JP3998664B2

JP3998664B2 - Parallel word classification system, parallel word classification method, and parallel word classification program

Info

Publication number: JP3998664B2
Application number: JP2004171622A
Authority: JP
Inventors: 美樹佐々木; 稔樹村田
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2004-06-09
Filing date: 2004-06-09
Publication date: 2007-10-31
Anticipated expiration: 2024-06-09
Also published as: JP2005352676A

Description

本発明は、対訳語句分類システム、対訳語句分類方法及び対訳語句分類プログラムに関し、例えば、自然言語処理において語の分野を自動的に判定して分類する装置、方法及びプログラムに関する。 The present invention relates to a bilingual phrase classification system, a bilingual phrase classification method, and a bilingual phrase classification program. For example, the present invention relates to an apparatus, a method, and a program for automatically determining and classifying a field of words in natural language processing.

自然言語処理において、日々生まれる新しい言葉に対応するために、辞書に語を登録し続ける必要がある。辞書に語を分類する場合、人手では適切かつ統一性を保つようにすることが困難である。 In natural language processing, it is necessary to continue to register words in the dictionary in order to cope with new words that are born every day. When classifying words in a dictionary, it is difficult to keep them appropriate and uniform manually.

例えば機械翻訳などにおいて専門的な辞書として利用する場合がある。この場合において、専門的辞書としての目的を達成するためには、単に語を分類するだけでなく階層的に構築し、出来る限り狭い対象範囲の分野に、すなわち出来る限り下層の分野に分類するのが効果的であるが、この場合、どの階層に語を分類することが最も適切かを決定するのは困難である。また、単語を分類する際、その単語の訳語をも考慮に入れる必要があるため、どの階層に分類するかは尚更困難である。 For example, it may be used as a specialized dictionary in machine translation. In this case, in order to achieve the purpose of a specialized dictionary, not only words are classified but also hierarchically constructed, and classified into the narrowest possible field, that is, into the lower field as much as possible. Is effective, but in this case, it is difficult to determine in which hierarchy it is most appropriate to classify words. In addition, when classifying words, it is necessary to take into account the translations of the words, so it is still more difficult to classify them.

非特許文献１では、翻訳を行う際に収集した未登録の名詞句を、分野別辞書に含まれるパターンに何回ヒットしたかという情報から、主分野、副分野と呼ぶ二段に階層化された分野別辞書へ分類する技術が記載されている。 In Non-Patent Document 1, unregistered noun phrases collected at the time of translation are hierarchized into two levels called main field and sub-field from information on how many times the pattern included in the field dictionary is hit. A technique for classifying into a dictionary by field is described.

非特許文献１では、まず、全副分野で計算を行って、固有かつ頻度の高いものを副分野辞書に分類する。それらを取り除いた後に、全主分野で同じ計算を行って、主分野辞書に分類する。この計算は、分野を要素としたベクトル空間中のベクトルとみなして行うことが記載されている。また、辞書に語が登録されることにより、分野別辞書の語彙が増すので、分野推定の精度を向上させることができる。 In Non-Patent Document 1, first, calculation is performed in all sub-fields, and unique and high-frequency items are classified into sub-field dictionaries. After removing them, the same calculation is performed in all main fields and classified into the main field dictionary. It is described that this calculation is performed as a vector in a vector space having a field as an element. Moreover, since the vocabulary of the field-specific dictionary is increased by registering words in the dictionary, the field estimation accuracy can be improved.

また、非特許文献１には、単語の訳語については、訳語を知らないシステムにその単語の訳語を生成させたり、人手で訳語を分類したりするとして記載されている。
神山淑朗，伊藤晴美，「自律的語彙拡充を行う機械翻訳システム」，情報処理学会第６５回全国大会，２００３，ｐｐ．２−５〜２−６ Non-Patent Document 1 describes a translation of a word by causing a system that does not know the translation to generate a translation of the word or manually classifying the translation.
Goro Kamiyama and Harumi Ito, “Machine Translation System for Autonomous Vocabulary Expansion”, IPSJ 65th National Convention, 2003, pp. 2-5 to 2-6

しかしながら、上述した非特許文献１に記載の従来技術において、訳語を分類する際、例えば人手で訳語を与えたとしても、分野推定時には訳語の意味を考慮することができないため、訳語が違えば分類先を変えたい語句を区別して分野推定することはできないので、正しく分類できないということがある。 However, in the prior art described in Non-Patent Document 1 described above, when the translated words are classified, for example, even if the translated words are given manually, the meaning of the translated words cannot be considered at the time of field estimation. Since it is not possible to estimate the field by distinguishing the words that you want to change, you may not be able to classify correctly.

また、分野推定は分野別辞書の内容に依存しているため、正しく判断できなかった語句が登録されて分野別辞書の語彙が増えると、その後の分野推定が正しく判断できない場合が生じ得る。この場合、正しく判断できなかった語句が登録されるにつれ、分野別辞書の質が落ちる可能性がある。 In addition, since the field estimation depends on the contents of the field dictionary, if words that cannot be correctly determined are registered and the vocabulary of the field dictionary increases, there may be cases where the field estimation cannot be correctly determined thereafter. In this case, as the words that could not be judged correctly are registered, the quality of the field-specific dictionary may deteriorate.

さらに、上述した非特許文献１では、階層毎に計算を繰り返す必要がある。その一方で、機械翻訳の訳質の向上などのためには、より多くの階層化が望まれる。そのため、階層が深くなると階層数に応じた計算回数が増えるし、さらに計算方式が分野別辞書の登録語に依存しているため、分野別辞書の登録語数が増えると、全体として必要な計算量が膨大になる。 Furthermore, in Non-Patent Document 1 described above, it is necessary to repeat the calculation for each layer. On the other hand, in order to improve the translation quality of machine translation, more hierarchies are desired. Therefore, the number of calculations increases according to the number of hierarchies as the hierarchy becomes deeper, and the calculation method depends on the registered words in the domain dictionary. Becomes enormous.

そのため、上述した課題を解決するため、判定対象となる原語とその訳語のカテゴリを判定して、原語に対する多義の訳語を考慮しつつ、効率的に語を分類する対訳語句分類システム、対訳語句分類方法及び対訳語句分類プログラムが求められている。 Therefore, in order to solve the above-described problems, the translated word / phrase classification system and the translated word / phrase classification for efficiently classifying the words while determining the original word to be determined and the category of the translated word and considering the ambiguous translated word for the original word What is needed is a method and bilingual word classification program.

かかる課題を解決するために、第１の本発明に係る対訳語句分類システムは、入力された判定対象語の原語及びその訳語のそれぞれのカテゴリを判定し、判定対象語をカテゴリ別に分類する対訳語句分類システムにおいて、（１）あるカテゴリを特徴的に表現する１又は複数のカテゴリ代表語を作成すると共に、カテゴリ別に分類された文書中における各カテゴリ代表語の特徴度合いを示す分野関連度を作成するカテゴリ代表語作成手段と、（２）カテゴリ代表語作成手段が作成した各カテゴリ代表語及びその分野関連度を保持するカテゴリ代表語保持手段と、（３）判定対象原語と共起関係にある１又は複数の語を原語未分類文書から抽出し、その抽出した共起関係にある各語のうち、カテゴリ代表語である１又は複数の語の分野関連度をカテゴリ代表語保持手段から取り出して、判定対象原語の分野判定度リストを作成する原語判定手段と、（４）判定対象訳語と共起関係にある１又は複数の語を訳語未分類文書から抽出し、その抽出した共起関係にある各語の原語訳となる１又は複数の語を割り出し、この原語訳となる各語のうち、カテゴリ代表語である１又は複数の語の分野関連度をカテゴリ代表語保持手段から取り出して、判定対象訳語の分野判定度リストを作成する訳語判定手段と、（５）判定対象原語の分野判定度リスト及び判定対象訳語の分野判定度リストにある各カテゴリ代表語の分野判定度を統合し、判定対象語についてのカテゴリ別の統合分野判定度を作成する対訳判定手段と、（６）対訳判定手段が作成したカテゴリ別の統合分野判定度に基づいて、判定対象語のカテゴリ分類情報を出力する出力手段とを備えることを特徴とする。 In order to solve such a problem, the bilingual phrase classification system according to the first aspect of the present invention determines the original word of the input judgment target word and the category of the translation word, and classifies the judgment target word into categories. In the classification system, (1) one or a plurality of category representative words that characteristically express a certain category are created, and a field relevance degree that indicates the feature level of each category representative word in a document classified by category is created. Category representative word creating means, (2) Category representative word holding means for holding each category representative word created by the category representative word creating means and its field relevance, and (3) 1 co-occurring with the determination target original word Alternatively, a plurality of words are extracted from the original uncategorized document, and among the extracted words having the co-occurrence relationship, the field relevance of one or more words that are category representative words is obtained. (4) One or a plurality of words co-occurring with the judgment target translation word are extracted from the translation uncategorized document, and extracted from the category representative word holding means and creating a field judgment degree list of the judgment target source word. Then, one or more words that are the source language translations of the extracted words in the co-occurrence relationship are determined, and the field relevance of one or more words that are the category representative words among the words that are the source language translations is categorized. (5) each category representative word in the field judgment degree list of the judgment target source word and the field judgment degree list of the judgment target translation word, which is extracted from the representative word holding means and creates a field judgment degree list of the judgment target translation word And (6) a bilingual determination unit that creates an integrated field determination level for each category of a determination target word, and (6) a determination based on the integrated field determination level for each category created by the parallel determination unit. And an outputting means for outputting the category classification information for elephants word.

第２の本発明の対訳語句分類方法は、第１の本発明の対訳語句分類システムに対応するものである。すなわち、第２の本発明の対訳語句分類方法は、入力された判定対象語の原語及びその訳語のそれぞれのカテゴリを判定し、判定対象語をカテゴリ別に分類する対訳語句分類方法において、（１）カテゴリ代表語作成手段が、あるカテゴリを特徴的に表現する１又は複数のカテゴリ代表語を作成すると共に、カテゴリ別に分類された文書中における各カテゴリ代表語の特徴度合いを示す分野関連度を作成し、（２）カテゴリ代表保持手段が、カテゴリ代表語作成手段が作成した各カテゴリ代表語及びその分野関連度を保持し、（３）原語判定手段が、判定対象原語と共起関係にある１又は複数の語を原語未分類文書から抽出し、その抽出した共起関係にある各語のうち、カテゴリ代表語である１又は複数の語の分野関連度をカテゴリ代表語保持手段から取り出して、判定対象原語の分野判定度リストを作成し、（４）訳語判定手段が、判定対象訳語と共起関係にある１又は複数の語を訳語未分類文書から抽出し、その抽出した共起関係にある各語の原語訳となる１又は複数の語を割り出し、この原語訳となる各語のうち、カテゴリ代表語である１又は複数の語の分野関連度をカテゴリ代表語保持手段から取り出して、判定対象訳語の分野判定度リストを作成し、（５）対訳判定手段が、判定対象原語の分野判定度リスト及び判定対象訳語の分野判定度リストにある各カテゴリ代表語の分野判定度を統合し、判定対象語についてのカテゴリ別の統合分野判定度を作成し、（６）出力手段が、対訳判定手段が作成したカテゴリ別の統合分野判定度に基づいて、判定対象語のカテゴリ分類情報を出力することを特徴とする。 The bilingual phrase classification method of the second aspect of the present invention corresponds to the bilingual phrase classification system of the first aspect of the present invention. That is, the bilingual phrase classification method according to the second aspect of the present invention is the bilingual phrase classification method for determining the original word of the input determination target word and each category of the translation word, and classifying the determination target word by category. The category representative word creation means creates one or a plurality of category representative words that characteristically represent a certain category, and creates a field relevance level indicating the characteristic level of each category representative word in a document classified by category. (2) The category representative holding means holds each category representative word created by the category representative word creating means and the field relevance thereof, and (3) the original word judging means is co-occurrence with the judgment target original word 1 or A plurality of words are extracted from the original language uncategorized document, and the category relevance of the category relevance of one or more words that are category representative words is extracted from the extracted co-occurrence words. (4) Translation word determination means extracts one or more words co-occurring with the determination target translation word from the unclassified translation document, and extracts the extracted words One or a plurality of words that are the source language translations of each word in the co-occurrence relationship are determined, and the category relevance of the field relevance of one or more words that are the category representative words among the words that are the source language translations is retained The field judgment degree list of the judgment target translation is extracted from the means, and (5) the field of each category representative word in the field judgment degree list of the judgment target original word and the field judgment degree list of the judgment target translation The determination degrees are integrated, and an integrated field determination degree for each category for the determination target word is created. (6) The output means determines the determination target word based on the integrated field determination degrees for each category created by the parallel translation determination means. Category classification And outputting a broadcast.

第３の本発明の対訳語句分類プログラムは、第１の本発明の対訳語句分類システムに対応するものである。すなわち、第３の本発明の対訳語句分類プログラムは、入力された判定対象語の原語及びその訳語のそれぞれのカテゴリを判定し、判定対象語をカテゴリ別に分類する対訳語句分類装置に、（１）あるカテゴリを特徴的に表現する１又は複数のカテゴリ代表語を作成すると共に、カテゴリ別に分類された文書中における各カテゴリ代表語の特徴度合いを示す分野関連度を作成するカテゴリ代表語作成手段と、（２）カテゴリ代表語作成手段が作成した各カテゴリ代表語及びその分野関連度を保持するカテゴリ代表語保持手段と、（３）判定対象原語と共起関係にある１又は複数の語を原語未分類文書から抽出し、その抽出した共起関係にある各語のうち、カテゴリ代表語である１又は複数の語の分野関連度をカテゴリ代表語保持手段から取り出して、判定対象原語の分野判定度リストを作成する原語判定手段と、（４）判定対象訳語と共起関係にある１又は複数の語を訳語未分類文書から抽出し、その抽出した共起関係にある各語の原語訳となる１又は複数の語を割り出し、この原語訳となる各語のうち、カテゴリ代表語である１又は複数の語の分野関連度をカテゴリ代表語保持手段から取り出して、判定対象訳語の分野判定度リストを作成する訳語判定手段と、（５）判定対象原語の分野判定度リスト及び判定対象訳語の分野判定度リストにある各カテゴリ代表語の分野判定度を統合し、判定対象語についてのカテゴリ別の統合分野判定度を作成する対訳判定手段と、（６）対訳判定手段が作成したカテゴリ別の統合分野判定度に基づいて、判定対象語のカテゴリ分類情報を出力する出力手段とを実現させることを特徴とする。 The parallel word / phrase classification program according to the third aspect of the present invention corresponds to the parallel word / phrase classification system according to the first aspect of the present invention. That is, the parallel word / phrase classification program according to the third aspect of the present invention provides a parallel word / phrase classification device that determines the original word of the input determination target word and each category of the translation word, and classifies the determination target word by category (1). Category representative word creating means for creating one or a plurality of category representative words that characteristically express a certain category, and creating a field relevance level indicating a characteristic degree of each category representative word in a document classified by category; (2) Category representative word holding means for holding each category representative word created by the category representative word creating means and its field relevance, and (3) one or more words co-occurring with the determination target original word The category relevance measure is used to extract the field relevance of one or more words that are category representative words out of the extracted co-occurrence words from the classified document. And (4) extracting one or a plurality of words having a co-occurrence relationship with the determination target translation from the uncategorized document and extracting the co-occurrence One or a plurality of words that are the source language translations of each related word are determined, and the field relevance level of the one or more words that are the category representative words out of each word that is the source language translation is extracted from the category representative word holding means. And (5) integrating the field judgment degree of each category representative word in the field judgment degree list of the judgment subject original word and the field judgment degree list of the judgment subject translation word. And (6) the category classification information of the determination target word based on the category-specific integrated field determination degree created by the parallel translation determination unit. output Characterized in that to achieve an output unit that.

本発明の対訳語句分類システム、対訳語句分類方法及び対訳語句分類プログラムによれば、辞書の内容の充実具合に関係なく、カテゴリの階層の深さに関係なく、原語と訳語の多義を解消して、対訳の語や句を分類することができる。 According to the parallel word / phrase classification system, parallel word / phrase classification method, and parallel word / phrase classification program of the present invention, the ambiguity between the original word and the translated word is resolved regardless of the depth of the category, regardless of the degree of the contents of the dictionary. Can categorize bilingual words and phrases.

以下、本発明の対訳語句分類システム、対訳語句分類方法及び対訳語句分類プログラムを実施するための最良の形態について図面を参照して説明する。 The best mode for carrying out the bilingual phrase classification system, bilingual phrase classification method, and bilingual phrase classification program of the present invention will be described below with reference to the drawings.

以下で説明する実施形態では、対訳語句の分野判定をして辞書に分類する対訳語句分類システムに、本発明の対訳語句分類システム、対訳語句分類方法及び対訳語句分類プログラムを適用した場合について説明する。 In the embodiment described below, a case will be described in which the bilingual phrase classification system, the bilingual phrase classification method, and the bilingual phrase classification program of the present invention are applied to a bilingual phrase classification system that performs field determination of bilingual phrases and classifies them into a dictionary. .

以下の実施形態では、原語側で、分野に特徴的でかつ代表的な単語を各分野に用意しておき、それらの単語を利用して原語を分野判定し、それらの単語と訳語側から原語側への単語辞書を利用して訳語を分野判定し、分野判定した結果を統合することによって、辞書の内容の充実具合に関係なく、階層の深さに関係なく、原語と訳語の多義を解消して、対訳の語や句を分類することができるようにした。 In the following embodiments, on the source language side, typical words that are characteristic to the field are prepared in each field, the source language is determined using the words, and the source language is determined from those words and the translated word side. By using the word dictionary to the side to determine the field of translations and integrating the field determination results, the ambiguity between the original word and the translated word is resolved regardless of the level of the dictionary contents, regardless of the depth of the hierarchy. So, it was possible to classify parallel translation words and phrases.

（Ａ）第１の実施形態
以下、本発明の対訳語句分類システム、対訳語句分類方法及び対訳語句分類プログラムの第１の実施形態について図面を参照して説明する。 (A) First Embodiment A first embodiment of a bilingual phrase classification system, a bilingual phrase classification method, and a bilingual phrase classification program according to the present invention will be described below with reference to the drawings.

以下の説明において、カテゴリ（分野）に特徴的でかつ代表的な単語をコアワードと定義する。また、複数の要素が、一定の範囲内（例えば、語、文、段落、文章など）に同時に現れることを共起という。同時に現れる語や句を共起関係にある語又は句という。コアワードと共起関係にある語（語句）はコアワードと同じカテゴリになる、と定義する。例では、改行までの一文内に同時に現れた単語から不要語を除いた単語を共起関係にある語とする。名詞、動詞、形容詞、形容動詞、未知語以外を不要語とする。 In the following description, a typical word characteristic to a category (field) is defined as a core word. In addition, co-occurrence is that a plurality of elements appear simultaneously within a certain range (for example, words, sentences, paragraphs, sentences, etc.). Words and phrases that appear at the same time are called co-occurrence words or phrases. A word (phrase) that co-occurs with a core word is defined to be in the same category as the core word. In the example, a word in which unnecessary words are excluded from words that appear simultaneously in a sentence up to a line feed is a word having a co-occurrence relationship. Use unnecessary words other than nouns, verbs, adjectives, adjective verbs, and unknown words.

ある語と共起関係にある語を抽出するには、ある語を含む文を検索して改行までの一文を抽出して形態素解析することにする。 In order to extract a word having a co-occurrence relationship with a certain word, a sentence including the certain word is searched, and a sentence up to a line feed is extracted and morphological analysis is performed.

コアワードには、カテゴリに属する度合いを示す値を付与する。その値を分野関連度と定義する。分野関連度が大きいほどカテゴリに属する度合いが強い、とする。 A value indicating the degree of belonging to the category is assigned to the core word. The value is defined as the field relevance. It is assumed that the greater the field relevance, the stronger the degree of belonging to the category.

対訳は原語と訳語から成る。分野判定したい原語を判定対象原語、分野判定したい訳語を判定対象訳語、分野判定したい対訳の語を判定対象語と定義する。例では、原語を日本語、訳語を英語とする。 A parallel translation consists of an original word and a translated word. The original language desired to be determined as the field is defined as the determination target original word, the translated word desired as the field determination is defined as the determination target translated word, and the parallel translation word desired as the field determination is defined as the determination target word. In the example, the original language is Japanese and the translation is English.

判定対象原語または判定対象訳語または判定対象語に対して、カテゴリに属する度合いを示すコアワードの値を、分野判定度と定義し、カテゴリとコアワードと分野判定度との組を、分野判定度のリストと定義する。 For the judgment target original word, judgment target translation or judgment target word, the value of the core word indicating the degree belonging to the category is defined as the field judgment degree, and the combination of the category, the core word, and the field judgment degree is a list of the field judgment degrees. It is defined as

判定対象語に対して、カテゴリに属する度合いを示す値を、統合分野判定度と定義し、カテゴリと統合分野判定度の組を、統合分野判定度のリストと定義する。 For the determination target word, a value indicating a degree belonging to a category is defined as an integrated field determination degree, and a set of the category and the integrated field determination degree is defined as a list of integrated field determination degrees.

（Ａ−１）第１の実施形態の構成
図１は、本実施形態の対訳語句分類システムの機能的構成を示すブロック図である。第１の実施形態の対訳語句分類システム１００は、例えば、入出力手段を備えるパソコン等の情報処理装置上に、対訳語句分類プログラムをインストールすること等によって実現されるが、機能的には図１に表わすことができる。 (A-1) Configuration of the First Embodiment FIG. 1 is a block diagram showing a functional configuration of the bilingual phrase classification system of this embodiment. The bilingual phrase classification system 100 according to the first embodiment is realized, for example, by installing a bilingual phrase classification program on an information processing apparatus such as a personal computer provided with input / output means. Can be expressed as

図１において、第１の実施形態の対訳語句分類システム１００は、入力手段１、コアワード作成手段２、原語判定手段３、訳語判定手段４、対訳判定手段５、出力手段６、原語分類済文書７、コアワード辞書８、原語未分類文書９、訳語未分類文書１０、単語辞書１１、カテゴリ辞書１２を有する。 In FIG. 1, the bilingual phrase classification system 100 of the first embodiment includes an input unit 1, a core word creation unit 2, a source language determination unit 3, a translation word determination unit 4, a translation translation determination unit 5, an output unit 6, and a source language classified document 7. , A core word dictionary 8, an original word unclassified document 9, a translated word unclassified document 10, a word dictionary 11, and a category dictionary 12.

また、以下で説明する、形態素解析処理、検索処理、抽出処理、データベース処理、格納処理など、一般的な処理に関しては、既知の自然言語処理技術を利用することができる。 For general processing such as morphological analysis processing, search processing, extraction processing, database processing, and storage processing described below, known natural language processing techniques can be used.

入力手段１は、例えば、キーボード等の一般的な入力手段だけでなく、記録媒体のアクセス装置等のファイル読み込み装置や、文書をイメージデータとして読み込んでそれをテキストデータに置き換える文字認識装置等も該当し、文書や分類対象の語を入力する手段であり、また、適宜、動作モード等も指示するものである。 For example, the input unit 1 is not only a general input unit such as a keyboard, but also a file reading device such as a recording medium access device, a character recognition device that reads a document as image data and replaces it with text data, and the like. It is a means for inputting a document or a word to be classified, and also indicates an operation mode or the like as appropriate.

コアワード作成手段２は、原語側のコアワードと分野関連度を作成する手段である。コアワード作成手段２は、原語分類済文書７の文書をカテゴリ別に形態素解析し、その形態素解析結果から不要語を除いた単語を抽出してカテゴリ別にコアワードを作成するものである。また、コアワード作成手段２は、分野関連度を計算し、作成したコアワードと分野関連度とをコアワード辞書８に格納するものである。 The core word creating means 2 is a means for creating the core word on the original language side and the field relevance. The core word creation means 2 performs morphological analysis of the document of the source language classified document 7 for each category, extracts words excluding unnecessary words from the morpheme analysis result, and creates core words for each category. The core word creation means 2 calculates field relevance and stores the created core word and field relevance in the core word dictionary 8.

ここで、分野関連度は、上述したように、コアワードがカテゴリに属する度合い（そのカテゴリの特徴をよく表現する度合い）を示すことができる値であればよいため、様々な方法で分野関連度を求めることができる。本実施形態では、ｔｆ＊ｉｄｆ法で計算した値を利用し、分野関連度を計算するものとする。 Here, as described above, the field relevance may be a value that can indicate the degree to which the core word belongs to a category (the degree that expresses the characteristics of the category well). Can be sought. In the present embodiment, the field relevance is calculated using a value calculated by the tf * idf method.

ｔｆ＊ｉｄｆ法では、文書集合内のある文書ｄにおける単語（ここでは、コアワード）の出現頻度ｔｆ（文書内語句頻度：ＴｅｒｍＦｒｅｑｕｅｎｃｙ）と、各単語ｔが１回以上出現する文書数ｄｆ（文書頻度：ＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）とを計算し、この文書数ｄｆを下記式（１）で計算してｉｄｆ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）を求める。 In the tf * idf method, the appearance frequency tf (term frequency within a document: Term Frequency) of a word (here, a core word) in a document d in the document set, and the number of documents df (documents) in which each word t appears one or more times. Frequency: Document Frequency), and the number of documents df is calculated by the following equation (1) to obtain idf (Inverse Document Frequency).

ｉｄｆ（ｔ）＝ｌｏｇ（Ｎ／ｄｆ（ｔ）） …（１）
ここで、Ｎは文書の数である。 idf (t) = log (N / df (t)) (1)
Here, N is the number of documents.

分野関連度を決める成分ｗ（ｔ，ｄ）は、このｉｄｆと、ｔｆを用いて次の式によって定義される。 The component w (t, d) that determines the field relevance is defined by the following equation using this idf and tf.

ｗ（ｔ，ｄ）＝ｔｆ（ｄ，ｔ）＊ｉｄｆ（ｔ） …（２）
ｔｆを用いるのは、文書で繰り返し出現する単語ほどその文書において重要な単語であると考えられるためである。また、ｉｄｆは、その単語（ここでは、コアワード）が前記文書集合内においてその文書を特定する能力を示している。文書集合中で多くの文書に現れる一般的な単語の場合にはｉｄｆは小さくなり、逆に、特定の文書にしか現れない単語の場合にはｉｄｆは大きくなるからである。 w (t, d) = tf (d, t) * idf (t) (2)
The reason for using tf is that a word that repeatedly appears in a document is considered to be an important word in the document. Idf indicates the ability of the word (here, the core word) to identify the document in the document set. This is because idf is small for a general word appearing in many documents in the document set, and conversely, idf is large for a word appearing only in a specific document.

ｔｆ＊ｉｄｆでは、送単語数が多いほど大きい値をとりうるので、カテゴリ間での調整が必要である。そこで、カテゴリ毎に、ｔｆ＊ｉｄｆをコアワード総数で割った値を、分野関連度とする。 Since tf * idf can take a larger value as the number of transmitted words is larger, adjustment between categories is necessary. Therefore, for each category, a value obtained by dividing tf * idf by the total number of core words is set as the field relevance.

分野関連度（カテゴリ，コアワード）
＝ｔｆ＊ｉｄｆ／カテゴリ毎のコアワード総数 …（３）
原語判定手段３は、判定対象原語の分野判定度のリストを作成する手段である。原語判定手段３は、入力された判定対象原語を取り込み、その判定対象原語を原語未分類文書９の原語文書から検索し、その原語文書から判定対象原語と共起関係にある語を抽出するものである。また、原語判定手段３は、抽出した共起関係にある語をコアワード辞書８から検索し、その検索したコアワードの分野関連度に対して出現回数の重み付けをして分野判定度を計算し、判定対象原語の分野判定度のリストを作成するものである。 Field relevance (category, core word)
= Tf * idf / total number of core words per category (3)
The original language determination means 3 is a means for creating a list of field determination degrees of the determination target original language. The original language determination means 3 takes in the input determination target original language, searches the original language document of the original language uncategorized document 9 for the determination target original language, and extracts a word having a co-occurrence relationship with the determination target original language from the original language document It is. Further, the original word determination means 3 searches the extracted word having the co-occurrence relationship from the core word dictionary 8 and calculates the field determination degree by weighting the number of appearances with respect to the field relevance degree of the searched core word. A list of field judgment degrees of the target source language is created.

判定対象原語の分野判定度（判定対象原語，カテゴリ，コアワード）
＝分野関連度（カテゴリ，コアワード）
×出現回数（判定対象原語，コアワード） …（４）
訳語判定手段４は、判定対象訳語の分野判定度のリストを作成する手段である。訳語判定手段４は、入力された判定対象訳語を、訳語未分類文書１０の訳語文書から検索し、判定対象訳語と共起関係にある訳語を訳語文書から抽出するものである。また、訳語判定手段４は、抽出した共起関係にある語の原語訳を単語辞書１１から検索し、原語訳を形態素解析した結果から不要語を除いた単語を抽出し、その抽出した語をコアワード辞書８から検索し、検索したコアワードの分野関連度に対して、出現回数の重み付けをして、原語訳の数の重み付けをして、分野判定度を計算し、判定対象訳語の分野判定度のリストを出力するものである。 Judgment level of target language (source language, category, core word)
= Field relevance (category, core word)
× Number of appearances (original words to be judged, core words) (4)
The translated word judging means 4 is a means for creating a list of field judgment degrees of judgment target translated words. The translated word judging means 4 searches the translated word document of the translated word uncategorized document 10 for the inputted judgment target translated word, and extracts the translated word co-occurring with the judged target translated word from the translated word document. Further, the translation word judging means 4 searches the word dictionary 11 for the original word translation of the extracted words having the co-occurrence relationship, extracts the words excluding unnecessary words from the result of the morphological analysis of the source word translation, and extracts the extracted words. A search is made from the core word dictionary 8, the number of appearances is weighted with respect to the field relevance level of the searched core word, the number of source word translations is weighted, a field determination degree is calculated, and a field determination degree of the target translation target word Output a list of

このように、原語訳を形態素解析して不要語を除くのは、コアワードと一致させるためである。 The reason why the original word translation is subjected to morphological analysis and unnecessary words are removed is to match the core word.

判定対象訳語の分野判定度（判定対象訳語カテゴリ，コアワード）
＝分野関連度（カテゴリ，コアワード）
×出現回数（判定対象訳語，コアワード）
／原語訳の数（判定対象訳語） …（５）
また、上記式（５）に示すように、原語訳の数の重み付けとして、語が一致する分野関連度を、訳語に対する原語訳の数で割ることにより、訳語の多義に対応できるように訳語に対して複数の原語訳があることが出現回数に影響することを解消することができる。 Judgment level of target translation (Judgement target category, core word)
= Field relevance (category, core word)
× Number of appearances (translation target words, core words)
/ Number of source language translations (translations subject to judgment) (5)
In addition, as shown in the above formula (5), as the weighting of the number of source language translations, by dividing the field relevance of matching words by the number of source language translations for the translations, On the other hand, the fact that there are a plurality of source language translations can affect the number of appearances.

対訳判定手段５は、判定対象語の統合分野判定度を計算する手段である。対訳判定手段５は、原語判定手段３が作成した判定対象原語の分野判定度のリストと、訳語判定手段４が作成した判定対象訳語の分野判定度のリストとを取り込み、統合して判定対象語の分野判定度を計算して判定対象語の分野判定度のリストを作成するものである。また、対訳判定手段５は、作成した判定対象語の分野判定度のリストから、判定対象語のカテゴリ毎の統合分野判定度を計算し、そのカテゴリ毎の統合分野判定度をカテゴリ辞書１２に格納するものである。また、対訳判定手段５は、分野判定した判定対象語が属するカテゴリを出力手段６に与え出力させるものである。この出力手段が出力するカテゴリが、統合分野判定度が最大のカテゴリとする。 The parallel translation determination means 5 is a means for calculating the integrated field determination degree of the determination target word. The parallel translation determination means 5 takes in the field determination degree list of the determination target original words created by the source word determination means 3 and the field determination degree list of the determination target translation words created by the translation word determination means 4 and integrates them to determine the determination target words The field determination degree is calculated to create a list of field determination degrees of the determination target words. Further, the parallel translation determination means 5 calculates an integrated field determination degree for each category of the determination target word from the created list of field determination degrees of the determination target word, and stores the integrated field determination degree for each category in the category dictionary 12. To do. The bilingual determination means 5 gives the output means 6 the category to which the determination target word determined in the field belongs and outputs it. The category output by this output means is the category with the highest integrated field determination.

ここで、判定対象語の分野判定度を計算する方法は、以下の通りである。 Here, the method of calculating the field determination degree of the determination target word is as follows.

判定対象語の分野判定度の値は、原語側と訳語側の両方の分野判定度が高いほど高くなるようにすることが望まれる。しかし、原語側と訳語側では、共起関係にある語を抽出するための文書の内容と量が違うことや、共起関係にある語が多義であることによって、分野判定度を単純に比較することができない。そこで、判定対象原語の分野判定度と判定対象訳語の分野判定度をそれぞれの分野判定度の最大値で割って正規化し、調和平均をとることにする。 It is desirable that the field determination degree value of the determination target word be higher as the field determination degree on both the original word side and the translated word side is higher. However, the original word side and the translated word side simply compare the degree of field judgment because the content and quantity of the documents for extracting co-occurrence words are different and the co-occurrence words are ambiguous. Can not do it. Therefore, the field determination degree of the determination target original word and the field determination degree of the determination target translated word are divided by the maximum values of the respective field determination degrees, normalized, and a harmonic average is obtained.

判定対象語の分野判定度（判定対象語，カテゴリ，コアワード）
＝２×（Ｖ１／Ｍ１）×（Ｖ２／Ｍ２）
／（（Ｖ１／Ｍ１）＋（Ｖ２／Ｍ２）） …（６）
ここで、Ｖ１は、判定対象原語の分野判定度（判定対象原語，カテゴリ，コアワード）、Ｍ１は、判定対象原語の分野判定度（判定対象原語，カテゴリ，コアワード）の最大値、Ｖ２は、判定対象訳語の分野判定度（判定対象訳語，カテゴリ，コアワード）、Ｍ２は、判定対象訳語の分野判定度（判定対象訳語，カテゴリ，コアワード）の最大値である。 Judgment degree of judgment target word (judgment word, category, core word)
= 2 × (V1 / M1) × (V2 / M2)
/ ((V1 / M1) + (V2 / M2)) (6)
Here, V1 is a field determination degree of the determination target original word (determination target original word, category, core word), M1 is a maximum value of a field determination degree of the determination target original word (determination target original word, category, core word), and V2 is a determination The field judgment degree (judgment target translation, category, core word) of the target translation word, and M2 is the maximum value of the field judgment degree (judgment target translation word, category, core word) of the judgment target translation word.

また、判定対象語の統合分野判定度を計算する方法は、分野判定度がある閾値以上ならばカテゴリ毎に分野判定度を合計することにする。閾値で足切りするのは、分野判定度が低いものが数で勝る悪影響を排除するためである。閾値は、カテゴリを最適に決定することができる任意の値である。なお、本実施形態では、実験によって求めた最適値で説明する。 In addition, as a method of calculating the integrated field determination degree of the determination target word, if the field determination degree is equal to or greater than a certain threshold, the field determination degrees are totaled for each category. The reason why the threshold value is cut off is to eliminate the adverse effect that the field judgment degree is lower than the number. The threshold value is an arbitrary value that can optimally determine the category. In the present embodiment, description will be made with the optimum value obtained by experiment.

判定対象語の統合分野判定度（カテゴリ）
＝Σ判定対象語の分野判定度（カテゴリ，コアワード） …（７）
ただし、判定対象語の分野判定度≧α（αは閾値）。 Degree of integrated field judgment (category) of judgment target words
= Field judgment degree of Σ judgment target word (category, core word) (7)
However, the field determination degree of the determination target word ≧ α (α is a threshold value).

出力手段６は、入力した語のカテゴリを出力する手段である。出力手段６は、例えば、ディスプレイやプリンタ等の一般的な出力手段だけでなく、記録媒体へ格納する記憶媒体アクセス装置等もこの出力手段６に該当し得る。また、出力手段６は、適宜、動作モードによる動作処理の経過、結果などを出力し得る。 The output means 6 is means for outputting the category of the input word. The output means 6 can correspond to not only general output means such as a display and a printer, but also a storage medium access device for storing in a recording medium. Further, the output means 6 can appropriately output the progress of the operation process in the operation mode, the result, and the like.

原語分類済文書７は、カテゴリに分類された原語文書を格納するデータベースである。 The source language classified document 7 is a database that stores source language documents classified into categories.

コアワード辞書８は、コアワードの情報を格納する辞書である。コアワード辞書８は、カテゴリとコアワードと分野関連度との組を記述する。 The core word dictionary 8 is a dictionary that stores core word information. The core word dictionary 8 describes pairs of categories, core words, and field relevance.

原語未分類文書９は、未分類の原語文書を格納するデータベースである。 The original language unclassified document 9 is a database that stores uncategorized original language documents.

訳語未分類文書１０は、未分類の訳語文書を格納するデータベースである。 The translated word uncategorized document 10 is a database that stores unclassified translated word documents.

単語辞書１１は、訳語の原語訳の情報を格納する辞書である。単語辞書１１は、１つの訳語に対して１つ以上の原語訳を記述する。図９は、単語辞書１１が格納する情報例である。図９に示すように、単語辞書１１は、訳語、品詞、原語訳を項目例として単語情報を格納する。 The word dictionary 11 is a dictionary that stores information on the original translation of the translated word. The word dictionary 11 describes one or more source language translations for one translated word. FIG. 9 shows an example of information stored in the word dictionary 11. As shown in FIG. 9, the word dictionary 11 stores word information with translated words, parts of speech, and source language translations as item examples.

カテゴリ辞書１２は、判定対象語のカテゴリの情報を格納する辞書である。カテゴリ辞書１２は、判定対象語に対してカテゴリと統合分野判定度を記述する。 The category dictionary 12 is a dictionary that stores information on categories of determination target words. The category dictionary 12 describes the category and the integrated field determination degree for the determination target word.

（Ａ−２）第１の実施形態の動作
次に、本実施形態の対訳語句の分類システム１００の動作について図面を参照して説明する。 (A-2) Operation of First Embodiment Next, the operation of the bilingual phrase classification system 100 of this embodiment will be described with reference to the drawings.

本実施形態の対訳語句分類システム１００が、分野判定を行なうためには、カテゴリ毎に分類された原語文書を、事前にカテゴリ別の原語分類済文書７に格納し、また予めコアコードと分野関連度を作成しておく必要がある。そして、各種データベース及び各種辞書を利用して分野判定を行なう。 In order for the bilingual phrase classification system 100 of this embodiment to perform field determination, the original language document classified for each category is stored in the original language classified document 7 for each category in advance, and the core code and the field-related are previously stored. It is necessary to create a degree. Then, field determination is performed using various databases and various dictionaries.

以下では、本実施形態の対訳分類システム１００の全体動作を説明した後に、具体的な各種処理動作について説明する。 In the following, after describing the overall operation of the bilingual classification system 100 of the present embodiment, specific various processing operations will be described.

（Ａ−２−１）全体動作
図２は、対訳語句分類システム１００の全体動作を説明するフローチャートである。 (A-2-1) Overall Operation FIG. 2 is a flowchart for explaining the overall operation of the bilingual phrase classification system 100.

対訳分類システム１００は、ユーザにより入力手段１が操作されることで、各種処理が選択され、その選択された処理を動作する。 In the bilingual classification system 100, when the input unit 1 is operated by the user, various processes are selected, and the selected processes are operated.

例えば、図２に示すように、文書格納処理、コアワード作成処理、対訳分野判定処理又は終了（ＥＮＤ）の各種処理のいずれかが選択されることを受け付け（ステップ２０１）、ユーザの操作を入力手段１が取り込み、その操作に応じて各種処理を実行させる（ステップ２０２〜ステップ２０４）。 For example, as shown in FIG. 2, it is accepted that any one of a document storage process, a core word creation process, a bilingual field determination process, or an end (END) process is selected (step 201), and a user operation is input means. 1 takes in and executes various processes according to the operation (steps 202 to 204).

（Ａ−２−２）文書格納処理
図２における文書格納処理の動作について図３を参照して説明する。図３は、文書格納処理の動作を説明するフローチャートである。 (A-2-2) Document Storage Processing The document storage processing operation in FIG. 2 will be described with reference to FIG. FIG. 3 is a flowchart for explaining the operation of the document storage process.

図２において、文書格納処理が選択されると（ステップ２０１及び２０２）、図３に示すように、ユーザによる処理の選択を受け付ける（ステップ２２１）。 In FIG. 2, when the document storage process is selected (Steps 201 and 202), as shown in FIG. 3, the selection of the process by the user is accepted (Step 221).

ユーザによる処理選択を入力手段１は取り込み、その選択に応じて、カテゴリを指定してカテゴリ別に分類済の原語文書を入力して原語分類済文書７にカテゴリ別に格納するか（ステップ２２２）、未分類の原語文書を入力して原語未分類文書９に格納するか（ステップ２２３）、未分類の訳語文書を入力して訳語未分類文書１０に格納するか（ステップ２２４）、又は終了（ＥＮＤ）するかを実行する。 The input means 1 captures the user's processing selection, and in accordance with the selection, specifies a category, inputs a source language document classified by category, and stores it in the source language classified document 7 (step 222). Whether a classified original language document is input and stored in the original language unclassified document 9 (step 223), whether an uncategorized translated word document is input and stored in the translated word unclassified document 10 (step 224), or end (END) Do what you want.

（Ａ−２−３）コアワード作成処理
図２におけるコアワード作成処理の動作について図４を参照して説明する。図４は、コアワード作成処理の動作を説明するフローチャートである。 (A-2-3) Core Word Creation Processing The operation of the core word creation processing in FIG. 2 will be described with reference to FIG. FIG. 4 is a flowchart for explaining the operation of the core word creation process.

図２において、コアワード作成処理が選択されると（ステップ２０１及び２０３）、図３の処理を実行する。 In FIG. 2, when the core word creation process is selected (steps 201 and 203), the process of FIG. 3 is executed.

図３において、原語分類済文書７に格納されている文書は、コアワード作成手段２により、カテゴリ別に形態素解析される（ステップ２３１）。 In FIG. 3, the documents stored in the source language classified document 7 are morphologically analyzed for each category by the core word creating means 2 (step 231).

カテゴリ別に分類済の原語文書が形態素解析されると、コアワード作成手段２により、その形態素解析結果から不要語を除いた単語が抽出され、カテゴリ別のコアワードが作成される（ステップ２３２）。 When the original language document classified by category is subjected to morphological analysis, the core word creation means 2 extracts words excluding unnecessary words from the morphological analysis result, and creates core words for each category (step 232).

カテゴリ別のコアワードが作成されると、上述したように、上記式（１）〜（３）に従って分野関連度が計算され（ステップ２３３）、作成されたコアワードと計算された分野関連度とはコアワード辞書８に格納される（ステップ２３４）。 When the core word for each category is created, as described above, the field relevance is calculated according to the above formulas (1) to (3) (step 233), and the created core word and the calculated field relevance are the core words. It is stored in the dictionary 8 (step 234).

（Ａ−２−４）対訳分野判定処理
図２における対訳分野判定処理の動作について図５を参照して説明する。図５は、対訳分野判定処理の動作を説明するフローチャートである。 (A-2-4) Parallel Translation Field Determination Process The operation of the parallel translation field determination process in FIG. 2 will be described with reference to FIG. FIG. 5 is a flowchart for explaining the operation of the bilingual field determination process.

図２において、対訳分野判定処理が選択されると（ステップ２０１及び２０４）、図５の処理を実行する。 In FIG. 2, when the bilingual field determination process is selected (steps 201 and 204), the process of FIG. 5 is executed.

この対訳分野判定処理では、次のことを考慮する。共起関係の強さを反映する重み付けとして、抽出した共起関係にある語の出現回数を、語が一致する分野関連度にかけることにする。訳語の多義に対応できるように訳語に対して複数の原語訳があることが出現回数に影響することを解消するために、原語訳の数の重み付けとして、語が一致する分野関連度を訳語に対する原語訳の数で割ることにする。分野関連度に重み付けをした値を、分野判定度とする。 In this bilingual field determination process, the following is considered. As a weighting that reflects the strength of the co-occurrence relationship, the number of appearances of the words in the extracted co-occurrence relationship is multiplied by the field relevance level that matches the words. In order to eliminate the influence of multiple source language translations on the translation to affect the number of occurrences so that the ambiguity of the translation can be handled, the field relevance level for matching words is used as the weighting of the number of source language translations. Divide by the number of source language translations. A value obtained by weighting the field relevance is set as the field determination degree.

図５において、入力手段１は入力された判定対象語を取り込む（ステップ３０１）。 In FIG. 5, the input means 1 takes in the input determination target word (step 301).

そして、原語判定手段３により、原語未分類文書９に格納されている文書及びコアワード辞書８のコアワード及び分野関連度から、上述した上記式（４）に従って、判定対象原語の分野判定度が計算されて、判定対象原語の分野判定度のリストが作成される（ステップ３０２）。なお、この詳細な動作については図６を参照して後述する。 Then, the source language determination unit 3 calculates the field determination degree of the determination target original word from the document stored in the original word uncategorized document 9 and the core word and the field relevance level of the core word dictionary 8 according to the above-described equation (4). Thus, a list of field determination degrees of the determination target original language is created (step 302). This detailed operation will be described later with reference to FIG.

また、訳語判定手段４により、訳語未分類文書１０に格納されている文書及び単語辞書１１とコアワード辞書８を用いて、上述した上記式（５）に従って、判定対象訳語の分野判定度が計算され、判定対象訳語の分野判定度のリストが作成される（ステップ３０３）。なお、この詳細な動作については図７を参照して後述する。 Further, the translation determination unit 4 calculates the field determination degree of the determination target translation according to the above-described equation (5) using the document stored in the untranslated document 10 and the word dictionary 11 and the core word dictionary 8. Then, a list of field judgment degrees of judgment target translated words is created (step 303). This detailed operation will be described later with reference to FIG.

さらに、対訳判定手段５により、判定対象原語の分野判定度のリストと判定対象訳語の分野判定度のリストとから、判定対象語の分野判定度のリストが作成され、上記式（６）に従って、判定対象語のカテゴリ毎の統合分野判定度が計算され、カテゴリ辞書１２に格納される（ステップ３０４）。なお、この詳細な動作については図８を参照して後述する。 Further, the bilingual determination unit 5 creates a list of field determination degrees of the determination target words from the list of field determination degrees of the determination target original words and the list of field determination degrees of the determination target translation words, according to the above formula (6). The integrated field determination degree for each category of the determination target word is calculated and stored in the category dictionary 12 (step 304). This detailed operation will be described later with reference to FIG.

出力手段６は、分野判定された判定対象語が属するカテゴリを出力する（ステップ３０５）。 The output means 6 outputs the category to which the determination target word determined in the field belongs (step 305).

（Ａ−２−５）判定対象原語の分野判定度の計算処理
図６は、判定対象原語の分野判定度を計算する処理動作を説明するフローチャートである。 (A-2-5) Processing for calculating the field determination degree of the determination target original word FIG. 6 is a flowchart illustrating a processing operation for calculating the field determination degree of the determination target original word.

入力手段１は入力された判定対象原語を取り込む（ステップ３２１）。入力手段１が判定対象原語を取り込むと、原語判定手段３により、原語未分類文書９の原語文書から判定対象原語が検索される（ステップ３２２）。 The input means 1 takes in the input determination target original language (step 321). When the input unit 1 captures the determination target source language, the source language determination unit 3 retrieves the determination target source language from the source language document of the source language unclassified document 9 (step 322).

原語判定手段３により判定対象原語が検索されると、原語判定手段３により、検索された判定対象原語と共起関係にある語が原語文書から抽出される（ステップ３２３）。 When the determination target original word is retrieved by the original language determination unit 3, the source language determination unit 3 extracts a word having a co-occurrence relationship with the retrieved determination target original word from the original language document (step 323).

判定対象原語と共起関係にある語が原語文書から抽出されると、原語判定手段３により、その抽出された判定対象原語と共起関係にある語がコアワード辞書８から検索される（ステップ３２４）。 When a word having a co-occurrence relationship with the determination target original word is extracted from the original document, the original word determination unit 3 searches the core word dictionary 8 for a word having a co-occurrence relationship with the extracted determination target original word (step 324). ).

コアワード辞書８から判定対象原語と共起関係にある語が検索されると、原語判定手段３により、その検索されたコアワードの分野関連度が取り出され（ステップ３２５）、その取り出された分野関連度は、上記式（４）に従って、出現回数の重み付け（乗算）がなされ（ステップ３２６）、判定対象原語の分野判定度が計算され、判定対象原語の分野判定度のリストを出力する（ステップ３２７）。 When a word having a co-occurrence relation with the determination target original word is retrieved from the core word dictionary 8, the field relevance level of the retrieved core word is extracted by the original word determination means 3 (step 325), and the extracted field relevance level Is weighted (multiplied) by the number of appearances according to the above equation (4) (step 326), the field determination degree of the determination target original word is calculated, and a list of the field determination degrees of the determination target original word is output (step 327). .

（Ａ−２−６）判定対象訳語の分野判定度の計算処理
図７は、判定対象訳語の分野判定度を計算する処理動作を説明するフローチャートである。 (A-2-6) Processing for calculating the field determination degree of the determination target translation FIG. 7 is a flowchart for explaining the processing operation for calculating the field determination degree of the determination target translation.

入力手段１が入力された判定対象訳語を取り込む（ステップ３３１）。入力手段１が判定対象訳語を取り込むと、訳語判定手段４により、訳語未分類文書１０の訳語文書から判定対象訳語が検索される（ステップ３３２）。 The input target 1 is input by the input means 1 (step 331). When the input unit 1 fetches the determination target translation, the translation determination unit 4 searches for the determination target translation from the translation document of the untranslated document 10 (step 332).

訳語判定手段４により判定対象訳語が検索されると、訳語判定手段４により、判定対象訳語と共起関係にある訳語が訳語文書から抽出される（ステップ３３３）。 When the translation target judgment word 4 is searched for the judgment target translation word, the translation word judgment means 4 extracts a translation word co-occurring with the judgment target translation word from the translation word document (step 333).

判定対象訳語と共起関係にある訳語が抽出されると、訳語判定手段４により、その抽出された判定対象訳語と共起関係にある訳語の原語訳が単語辞書１１から検索される（ステップ３３４）。 When a translation having a co-occurrence relationship with the determination target translation is extracted, the translation determination unit 4 searches the word dictionary 11 for an original translation of the translation having a co-occurrence with the extracted determination target translation (step 334). ).

単語辞書１１から共起関係にある語の原語訳が検索されると、訳語判定手段４により、その検索された原訳語が形態素解析した結果から不要語を除いた単語が抽出される（ステップ３３５）。 When a source language translation of a word having a co-occurrence relationship is searched from the word dictionary 11, the translation word judging means 4 extracts a word from which unnecessary words are removed from the result of the morphological analysis of the searched source translation word (step 335). ).

訳語判定手段４により不要語を除いた単語が抽出されると、訳語判定手段４により、その抽出された語がコアワード辞書８から検索される（ステップ３３６）。 When the translated word determining unit 4 extracts a word excluding unnecessary words, the translated word determining unit 4 searches the extracted word from the core word dictionary 8 (step 336).

コアワード辞書８から検索されると、訳語判定手段４により、そのコアワードの分野関連度が取り出され（ステップ３３７）、上記式（５）に従って、その取り出された分野関連度に対して出現回数の重み付け（乗算）がなされ（ステップ３３８）、さらに原語訳の数の重み付け（除算）がなされ（ステップ３３９）、判定対象訳語の分野判定度が計算され、判定対象訳語の分野判定度のリストを出力する（ステップ３４０）。原語訳を形態素解析して不要語を除くのは、コアワードと一致させるためである。 When retrieved from the core word dictionary 8, the field relevance of the core word is extracted by the translated word determination means 4 (step 337), and the number of appearances is weighted for the extracted field relevance according to the above equation (5). (Multiplication) is performed (step 338), the number of source language translations is weighted (division) (step 339), the field judgment degree of the judgment target translation is calculated, and a list of field judgment degrees of the judgment target translation is output. (Step 340). The reason why the original word translation is subjected to morphological analysis and unnecessary words are removed is to match the core word.

（Ａ−２−７）判定対象語のカテゴリ毎の統合分野判定度の計算処理
図８は、判定対象語のカテゴリ毎の統合分野判定度を計算する処理動作を説明するフローチャートである。 (A-2-7) Calculation processing of integrated field determination degree for each category of determination target word FIG. 8 is a flowchart for explaining the processing operation for calculating the integrated field determination degree for each category of the determination target word.

判定対象原語の分野判定度のリストと判定対象訳語の分野判定度のリストとは、対訳判定手段５に入力し（ステップ３４１）、対訳判定手段５により統合されて、上記式（６）に従って、判定対象語の分野対象度が計算され、判定対象語の分野判定度のリストが作成される（ステップ３４２）。 The field judgment degree list of the judgment target original word and the field judgment degree list of the judgment target translation word are input to the parallel translation judgment means 5 (step 341), integrated by the parallel translation judgment means 5, and according to the above equation (6). The field target degree of the determination target word is calculated, and a list of field determination degrees of the determination target word is created (step 342).

判定対象語の分野判定度のリストが作成されると、対訳判定手段５により、判定対象語の分野判定度のリストから、上記式（７）に従って、判定対象語のカテゴリ毎の統合分野判定度が計算され（ステップ３４３）、計算された統合分野判定度が、カテゴリ辞書１２に格納される（ステップ３４４）。 When the list of the field determination degrees of the determination target words is created, the bilingual determination unit 5 calculates the integrated field determination degree for each category of the determination target words from the list of the field determination degrees of the determination target words according to the above formula (7). Is calculated (step 343), and the calculated integrated field determination degree is stored in the category dictionary 12 (step 344).

統合分野判定度がカテゴリ辞書１２に格納されると、分野判定した判定対象語が属するカテゴリを出力する（ステップ３４５）。出力するカテゴリは、統合分野判定度が最大のカテゴリとする。 When the integrated field determination degree is stored in the category dictionary 12, the category to which the determination target word determined in the field belongs is output (step 345). The category to be output is the category with the highest integrated field determination.

（Ａ−２−８）実施例
次に、判定対象語を「審判」“Ｊｕｄｇｍｅｎｔ”とするときの分野判定について説明する。 (A-2-8) Example Next, field determination when the determination target word is “judgment” “Judgment” will be described.

なお、以下では、文書格納処理及びコアワード作成処理は、予めなされているものとする。 In the following, it is assumed that document storage processing and core word creation processing are performed in advance.

入力手段１に判定対象語「審判」“ｊｕｄｇｅｍｅｎｔ”を入力すると、原語判定手段３により、判定対象原語の分野判定度のリストが作成されて出力される。この判定対象原語の分野判定度のリストの作成処理の詳細については上述したので省略する。 When the determination target word “judgment” is input to the input unit 1, the source language determination unit 3 creates and outputs a list of field determination degrees of the determination target source word. Since the details of the process of creating the field determination degree list of the determination target original language have been described above, a description thereof will be omitted.

図１０は判定対象原語の分野判定度のリスト例である。図１０に示すように、判定対象原語の分野判定度のリストは、原語、カテゴリ、コアワード、分野判定度を項目例として有する。 FIG. 10 is a list example of the field determination degree of the determination target original language. As illustrated in FIG. 10, the list of field determination degrees of the determination target original word includes original words, categories, core words, and field determination degrees as item examples.

分野判定度について、例えば、判定対象原語「審判」に対して共起関係にある語「選挙」を例とし、図１０を参照して説明する。 The field determination degree will be described with reference to FIG. 10 using, for example, the word “election” having a co-occurrence relationship with the determination target original word “referee”.

例えば、判定対象原語「審判」に対して共起関係にある語「選挙」が１９５回出現し、カテゴリ「政治」コアワード「選挙の分野関連度が０．００９６２であるとする場合、判定対象原語「審判」に対するカテゴリ「政治」コアワード「選挙」の分野判定度は、上記式（４）に従って、０．００９６２×１９５＝１．８７５となる。 For example, when the word “election” having a co-occurrence relationship with respect to the determination target original word “referee” appears 195 times and the category “politics” core word “election field relevance is 0.00962, the determination target original word The field determination degree of the category “politics” core word “election” with respect to “the referee” is 0.00962 × 195 = 1.875 according to the above equation (4).

また、訳語判定手段４により、判定対象訳語“Ｊｕｄｇｅｍｅｎｔ”と共起関係にある訳語を訳語文書から抽出する。また、抽出した共起関係にある語の原語訳を単語辞書１１から検索する。また、原語訳を形態素解析した結果から不要語を除いた単語を抽出し、抽出した語をコアワード辞書８から検索する。 Further, the translated word judging means 4 extracts the translated word co-occurring with the judgment target translated word “Judgment” from the translated word document. In addition, the word dictionary 11 is searched for a source language translation of the extracted words having a co-occurrence relationship. Further, a word excluding unnecessary words is extracted from the result of morphological analysis of the original word translation, and the extracted word is searched from the core word dictionary 8.

図１１は、判定対象訳語、その共起関係にある語、原語訳及び不要語を除いた語に基づくコアワードの関係を説明する説明図である。 FIG. 11 is an explanatory diagram for explaining a relationship between core words based on a word excluding a target translation word, a word having a co-occurrence relationship, an original language translation, and an unnecessary word.

さらに、訳語判定手段４により、検索したコアワードの分野関連度に対して、出現回数の重み付けがなされ、さらに原語訳の数の重み付けがなされ、分野判定度が計算され、判定対象訳語の分野判定度のリストが作成されて出力される。 Further, the translated word determining means 4 weights the number of appearances to the field relevance of the searched core word, further weights the number of source word translations, calculates the field determination, and determines the field determination of the translation target word List is created and output.

図１２は、判定対象訳語の分野判定度のリスト例である。図１２に示すように、判定対象訳語の分野判定度のリストは、訳語、カテゴリ、コアワード、分野判定度を項目例として有する。 FIG. 12 is a list example of the field determination degrees of the determination target translated words. As illustrated in FIG. 12, the list of field determination degrees of the determination target translated words includes translation words, categories, core words, and field determination degrees as item examples.

分野判定度について、例えば、判定対象訳語“ｊｕｄｇｅｍｅｎｔ”に対して共起関係にある語“ｅｌｅｃｔｉｏｎ”を例として説明する。 The field determination degree will be described using, for example, the word “selection” having a co-occurrence relationship with the determination target translation word “judgement”.

例えば、判定対象訳語“ｊｕｄｇｅｍｅｎｔ”に対して共起関係にある語“ｅｌｅｃｔｉｏｎ”が４回出現し、“ｅｌｅｃｔｉｏｎ”の原語訳が「選挙」「投票」などの８個であるとする。 For example, it is assumed that the word “selection” having a co-occurrence relationship with respect to the determination target translation word “judgement” appears four times, and the original word translation of “selection” is eight such as “election” and “voting”.

この場合、判定対象訳語“ｊｕｄｇｅｍｅｎｔ”対するカテゴリ「政治」コアワード「選挙」の分野判定度は、上記式（５）に従って、０．００９６２×４÷８＝０．００４８１となる。 In this case, the field determination degree of the category “politics” core word “election” for the determination target translated word “judgement” is 0.00962 × 4 ÷ 8 = 0.001 according to the above equation (5).

次に、対訳判定手段５で、図１０の判定対象訳語の分野判定度リストと、図１２の判定対象訳語の分野判定度のリストとから、図１３に示すような判定対象語の分野判定度のリストを作成する。 Next, the bilingual determination means 5 determines the field determination degree of the determination target word as shown in FIG. 13 from the field determination degree list of the determination target translation word in FIG. 10 and the field determination degree list of the determination target translation word in FIG. Create a list of

なお、図１３は、判定対象語の分野判定度のリスト例である。また、判定対象語の分野判定度のリストは、図１３に示すように、言語、訳語、カテゴリ、コアワード、分野判定度を項目例として有する。 FIG. 13 is a list example of field determination degrees of determination target words. In addition, as shown in FIG. 13, the list of field determination degrees of determination target words includes language, translation, category, core word, and field determination degree as item examples.

分野判定度について、判定対象語「審判」“ｊｕｄｇｅｍｅｎｔ”に対するカテゴリ「政治」コアワード「選挙」を例として、図１３を参照して説明する。 The field determination degree will be described with reference to FIG. 13, taking the category “politics” and the “election” for the determination target word “judgment” “judgment” as an example.

この場合、図１０に示すように、判定対象原語「審判」について、カテゴリ「政治」、コアワード「選挙」の分野判定度は１．８７５（＝Ｖ１）である。また、判定対象原語「審判」の分野判定度の最大値は、カテゴリ「野球」、コアワード「野球」の場合であり、２．５８１（＝Ｍ１）である。 In this case, as shown in FIG. 10, the field determination degree of the category “politics” and the core word “election” is 1.875 (= V1). Further, the maximum value of the field determination degree of the determination target original word “referee” is the case of the category “baseball” and the core word “baseball”, which is 2.581 (= M1).

また、図１２に示すように、判定対象訳語“ｊｕｄｇｅｍｅｎｔ”について、カテゴリ「政治」、コアワード「選挙」の分野判定度は０．００４８１（＝Ｖ２）である。また、判定対象訳語“ｊｕｄｇｅｍｅｎｔ”の分野判定度の最大値は、カテゴリ「政治」、コアワード「国会」の場合であり、０．０１０７６（＝Ｍ２）である。 Further, as shown in FIG. 12, for the determination target translation word “judgement”, the category determination degree of the category “politics” and the core word “election” is 0.00481 (= V2). In addition, the maximum value of the field determination degree of the determination target translated word “judgement” is in the case of the category “politics” and the core word “National Diet”, which is 0.01076 (= M2).

従って、判定対象語「審判」“ｊｕｄｇｅｍｅｎｔ”に対するカテゴリ「政治」コアワード「選挙」の分野判定度は、上記式（６）に従って、２×（１．８７５÷２．５８１）×（０．００４８１÷０．０１０７６）÷（（１．８７５÷２．５８１）＋（０．００４８１÷０．０１０７６））＝０．５５３となる。 Therefore, the field determination degree of the category “politics” core word “election” for the determination target word “judgment” “judgment” is 2 × (1.875 ÷ 2.581) × (0.00481 ÷) according to the above equation (6). 0.01076) ÷ ((1.875 ÷ 2.581) + (0.00481 ÷ 0.01076)) = 0.553.

次に、対訳判定手段５により、上記式（７）に従って、カテゴリ毎の統合分野判定度が計算される。 Next, the parallel field determination means 5 calculates the integrated field determination degree for each category according to the above equation (7).

統合分野判定度について、判定対象語「審判」“ｊｕｄｇｅｍｅｎｔ”に対するカテゴリ「政治」を例として図１３を参照して説明する。 The integrated field determination degree will be described with reference to FIG. 13 by taking the category “politics” for the determination target word “judgment” “judgment” as an example.

例えば、閾値αを０．０２とする場合、０．０２以上の分野判定度がカテゴリ毎に累積され、例えば、判定対象語「審判」“ｊｕｄｇｅｍｅｎｔ” に対するカテゴリ「政治」の統合分野判定度は、０．５５３＋０．１７０＋０．１２３＋…＋０．０２１＝１．１３８となる。 For example, when the threshold value α is set to 0.02, the field determination degree of 0.02 or more is accumulated for each category. For example, the integrated field determination degree of the category “politics” with respect to the determination target word “judgment” is 0.553 + 0.170 + 0.123 + ... + 0.021 = 1.138

以上のようにして、カテゴリ毎の統合分野判定度が計算されて、カテゴリ辞書１２に格納される。図１４は、カテゴリ辞書１２が格納する情報例を示すものである。図１４に示すように、カテゴリ辞書１２は、原語、訳語、カテゴリ、統合分野判定度を有する。 As described above, the integrated field determination degree for each category is calculated and stored in the category dictionary 12. FIG. 14 shows an example of information stored in the category dictionary 12. As shown in FIG. 14, the category dictionary 12 has original words, translated words, categories, and integrated field determination degrees.

そして、全てのカテゴリについて計算されると、対訳判定手段５により、統合分野判定度が最大となるカテゴリ「政治」が選出されて出力される。 When all the categories are calculated, the bilingual determination unit 5 selects and outputs the category “politics” having the maximum integrated field determination degree.

（Ａ−３）第１の実施形態の効果
以上、本実施形態によれば、分野に分類済の原語文書から前もってコアワードを作成しておくことによって、原語と訳語のそれぞれに含まれる多義や原語と訳語の間の多義を解消して、対訳の語や句を分類することができる。その際、原語側のコアワードさえあれば、訳語側から原語側への単語辞書と訳語側の文書を変更するだけで、多言語に対応することができる。このとき、単語辞書は単語の訳さえあればよいので、特別に作成する必要はないので記憶資源の節約にも寄与する。また、分野別辞書の内容や量に依存することなく分類することができる。 (A-3) Effects of the First Embodiment As described above, according to the present embodiment, the ambiguity and the original word included in each of the original word and the translated word by creating the core word in advance from the original word document classified in the field. The ambiguity between words and translations can be resolved, and the words and phrases in the translation can be classified. At that time, as long as there is a core word on the source language side, it is possible to cope with multiple languages simply by changing the word dictionary from the translation side to the source language side and the document on the translation side. At this time, the word dictionary need only have a translation of the word, so it does not need to be created specially, which contributes to saving of storage resources. Moreover, classification can be performed without depending on the contents and amount of the field-specific dictionary.

（Ｂ）第２の実施形態
次に、本発明の対訳語句分類システム、対訳語句分類方法及び対訳語句分類プログラムの第２の実施形態について図面を参照して説明する。 (B) Second Embodiment Next, a second embodiment of the bilingual phrase classification system, bilingual phrase classification method, and bilingual phrase classification program of the present invention will be described with reference to the drawings.

第２の実施形態も、第１の実施形態で説明した用語の定義を援用する。 The second embodiment also uses the definitions of terms described in the first embodiment.

（Ｂ−１）第２の実施形態の構成
図１５は、本実施形態の対訳語句分類システムの機能的構成を示すブロック図である。第２の実施形態の対訳語句分類システム２００も、第１の実施形態と同様に、例えば、入出力手段を備えるパソコン等の情報処理装置上に、対訳語句分類プログラムをインストールすること等によって実現されるが、機能的には図１５に表わすことができる。 (B-1) Configuration of Second Embodiment FIG. 15 is a block diagram showing a functional configuration of a bilingual phrase classification system according to this embodiment. Similarly to the first embodiment, the bilingual phrase classification system 200 of the second embodiment is realized by installing a bilingual phrase classification program on an information processing apparatus such as a personal computer provided with input / output means, for example. However, it can be functionally represented in FIG.

図１５において、第１の実施形態の対訳語句分類システム２００は、入力手段１、コアワード作成手段２、原語判定手段３、訳語判定手段４、対訳判定手段５、出力手段６、原語分類済文書７、コアワード辞書８、原語未分類文書９、訳語未分類文書１０、単語辞書１１、カテゴリ辞書１２、カテゴリ関係辞書１３を有する。 In FIG. 15, the bilingual phrase classification system 200 of the first embodiment includes an input unit 1, a core word creation unit 2, a source language determination unit 3, a translation word determination unit 4, a translation translation determination unit 5, an output unit 6, and a source language classified document 7. , A core word dictionary 8, an original word unclassified document 9, a translated word unclassified document 10, a word dictionary 11, a category dictionary 12, and a category relation dictionary 13.

なお、図１５において、図１で説明した同一・対応する機能構成については、説明の便宜上、対応する符号を付して説明する。 In FIG. 15, the same and corresponding functional configurations described in FIG. 1 will be described with corresponding reference numerals for convenience of description.

第２の実施形態の対訳語句分類システム２００は、カテゴリ関係辞書１３を備える点が、第１の実施形態の機能構成と異なる。また、カテゴリ関係辞書１３を備えることにより、コアワード作成手段２による文書格納処理、原語判定手段３によるコアワード作成処理の一部動作が異なる。従って、以下では、第１の実施形態と異なる機能を中心に説明する。 The bilingual phrase classification system 200 of the second embodiment is different from the functional configuration of the first embodiment in that it includes a category relation dictionary 13. Further, by providing the category relation dictionary 13, part of operations of the document storing process by the core word creating unit 2 and the core word creating process by the original word determining unit 3 are different. Accordingly, the following description focuses on functions different from those of the first embodiment.

第２の実施形態では、階層構造のカテゴリに分類することを意識する。つまり、複数の文書からなる文書集合について、その文書集合を構成する複数の文書のカテゴリが階層構造をなしている場合を想定し、その階層カテゴリに分類する。 In the second embodiment, attention is paid to classification into hierarchical categories. That is, regarding a document set made up of a plurality of documents, assuming that the categories of the plurality of documents constituting the document set have a hierarchical structure, they are classified into the hierarchical categories.

ここで、階層カテゴリについて図１６を参照して説明する。図１６は、本実施形態が想定する階層カテゴリを説明するイメージ図である。 Here, the hierarchical category will be described with reference to FIG. FIG. 16 is an image diagram for explaining hierarchical categories assumed in the present embodiment.

図１６に示すように、「ＴＯＰ」は最上層カテゴリであり、その下の階層には「スポーツ」、「コンピュータ」等のカテゴリが存在する。さらに「スポーツ」の下の階層には、「野球」、「サッカー」等のカテゴリが存在し、また「コンピュータ」の下の階層には、「ＯＳ」、「プログラミング」等のカテゴリが存在する。さらに、「ＯＳ」の下の階層には「ＯＳ１」、「ＯＳ２」等のカテゴリが存在する。 As shown in FIG. 16, “TOP” is the uppermost category, and categories such as “Sports” and “Computer” exist in the lower layer. Furthermore, categories such as “baseball” and “soccer” exist in the hierarchy below “sports”, and categories such as “OS” and “programming” exist in the hierarchy below “computer”. Furthermore, categories such as “OS1” and “OS2” exist in a hierarchy below “OS”.

これらカテゴリの間にはカテゴリ間の包含、被包含の関係に応じた親子関係が存在し、「ＴＯＰ」カテゴリの子にあたるのは「スポーツ」、「コンピュータ」等のカテゴリである。同様にして、「スポーツ」カテゴリの子にあたるのは「野球」、「サッカー」等のカテゴリであり、「コンピュータ」カテゴリの子にあたるのは「ＯＳ」、「プログラミング」等のカテゴリである。さらに、「ＯＳ」カテゴリの子にあたるのは「ＯＳ１」、「ＯＳ２」等のカテゴリである。 Between these categories, there is a parent-child relationship according to the inclusion and inclusion relationships between categories, and categories such as “sports” and “computer” are children of the “TOP” category. Similarly, categories such as “baseball” and “soccer” are children of the “sports” category, and categories such as “OS” and “programming” are children of the “computer” category. Furthermore, the categories such as “OS1” and “OS2” are children of the “OS” category.

このうち、「ＴＯＰ」カテゴリは最上層カテゴリであるから親を持たず、反対に最下層カテゴリである「野球」、「サッカー」、「プログラミング」、「ＯＳ１」等は子を持たない。なお、図１６では、黒丸で示すカテゴリを最下層カテゴリとする。 Among these, since the “TOP” category is the uppermost category, it has no parent, and on the contrary, the lowermost categories “baseball”, “soccer”, “programming”, “OS1”, etc. have no children. In FIG. 16, the category indicated by a black circle is the lowest layer category.

カテゴリ関係辞書１３は、カテゴリ間の階層関係の情報を格納する辞書である。例えば、図１６に示すような階層カテゴリの構成を人間が定義した際に、当該カテゴリ関係辞書１３を作成しておく。カテゴリ関係辞書１３の内容は、そのカテゴリを一意に指定するカテゴリ名と、その親カテゴリに関する情報とから構成される。 The category relationship dictionary 13 is a dictionary that stores information on hierarchical relationships between categories. For example, when a person defines a hierarchical category configuration as shown in FIG. 16, the category relation dictionary 13 is created. The contents of the category relation dictionary 13 are composed of a category name that uniquely designates the category and information related to the parent category.

図１７は、カテゴリ関係辞書１３の構成及び内容例を示すものであり、図１６の階層カテゴリを前提としたものである。図１７において、「−」は空値（すなわち無いことを示す）である。 FIG. 17 shows an example of the configuration and contents of the category relation dictionary 13 and is based on the hierarchical category of FIG. In FIG. 17, “−” is a null value (that is, indicates no value).

コアワード作成手段２は、原語分類済文書７を利用して、コアワードと分野関連度とを作成するものである。コアワード作成手段２は、まず、原語分類済文書７の最下層のカテゴリについてコアワードを作成し、その後、中間層のカテゴリについて、最下層のカテゴリのコアワードを利用してコアワードを作成する。 The core word creation means 2 creates the core word and the field relevance using the source language classified document 7. The core word creating means 2 first creates a core word for the lowermost category of the source language classified document 7, and then creates a core word for the middle layer category using the core word of the lowermost category.

このように、コアワード作成手段２が、最下層のカテゴリのコアワード作成処理と、中間層のカテゴリのコアワード作成処理とを異なる処理で行なうのは以下の理由からである。 As described above, the core word creation means 2 performs the core word creation processing of the lowermost category and the core word creation processing of the middle category by different processes for the following reasons.

階層構造のカテゴリに分類するために、最下層以外の途中の階層も含む全てのカテゴリにコアワードを付与することが必要である。カテゴリに属する文書を与える方法では、子カテゴリの内容を含まない「その他」にあたる文書の場合はカテゴリに属する文書として問題ないが、子カテゴリの内容を含む「全般」にあたる文書の場合は親と子の区別が曖昧になるので、正しくかつ出来る限り狭い対象範囲に分類することができないことがある。 In order to classify into hierarchical categories, it is necessary to assign core words to all categories including intermediate layers other than the lowest layer. In the method of giving a document belonging to a category, there is no problem as a document belonging to a category in the case of a document corresponding to “Other” that does not include the contents of the child category, but in the case of a document corresponding to “General” including the contents of the child category, the parent and child Since the distinction between the two becomes ambiguous, it may not be correctly classified into the narrowest possible scope.

しかし、途中の階層も含む全てのカテゴリに適切に分類された文書を用意することは、階層が深くなるほど労力を要するので、親カテゴリのコアワードは文書からは作成しない。 However, preparing documents appropriately classified into all categories including intermediate levels requires more labor as the levels become deeper, so the core word of the parent category is not created from the documents.

文書は、最下層のカテゴリの単位に分類された文書だけを用意し、最下層のカテゴリでは文書からコアワードを作成し、親カテゴリのコアワードは直下の子カテゴリのコアワードから作成することにする。 As the document, only a document classified in the unit of the lowest category is prepared. In the lowest category, a core word is created from the document, and the core word of the parent category is created from the core word of the child category immediately below.

（Ｂ−２）第２の実施形態の動作
次に、本実施形態の対訳語句分類システム２００の動作について説明する。 (B-2) Operation | movement of 2nd Embodiment Next, operation | movement of the bilingual phrase classification system 200 of this embodiment is demonstrated.

本実施形態の対訳語句分類システム２００が、分野判定を行なうためには、最下層のカテゴリに分類された原語文書を、事前にカテゴリ別の原語分類済文書７に格納し、その原語分類済文書７を利用して、コアコードと分野関連度を作成しておく必要がある。 In order for the bilingual word / phrase classification system 200 of this embodiment to perform field determination, the source language document classified into the lowest category is stored in the source language classified document 7 for each category in advance, and the source language classified document is stored. 7 must be used to create the core code and field relevance.

なお、最下層カテゴリのコアワードは、コアワード作成手段２が原語分類済文書７の文書から作成し、中間層カテゴリのコアワードは、コアワード作成手段２がカテゴリ関係辞書１３を利用して最下層カテゴリのコアワードから作成する。 The core word of the lowest layer category is created by the core word creating means 2 from the document of the original language classified document 7, and the core word of the middle layer category is created by the core word creating means 2 using the category relation dictionary 13. Create from.

その後、各種データベース及び各種辞書を利用して分野判定を行なう。 Then, field judgment is performed using various databases and various dictionaries.

以下では、文書格納処理、コアワード作成処理の動作について説明する。なお、対訳分類システム２００の全体動作及び対訳分野判定処理の動作は、第１の実施形態で説明した動作に対応するので詳細な説明は省略する。 Hereinafter, operations of the document storage process and the core word creation process will be described. Note that the overall operation of the bilingual classification system 200 and the operation of the bilingual field determination process correspond to the operation described in the first embodiment, and thus detailed description thereof is omitted.

（Ｂ−２−１）文書格納処理
図１８は、文書格納処理の動作を説明するフローチャートである。 (B-2-1) Document Storage Processing FIG. 18 is a flowchart for explaining the operation of document storage processing.

図２において、文書格納処理が選択されると（ステップ２０１及び２０２）、図１８に示すように、ユーザによる処理の選択を受け付ける（ステップ６０１）。 In FIG. 2, when the document storage process is selected (steps 201 and 202), the selection of the process by the user is accepted as shown in FIG. 18 (step 601).

ユーザによる処理選択を入力手段１は取り込み、その選択に応じて、最下層のカテゴリを指定して最下層のカテゴリ別に分類済の原語文書を入力して原語分類済文書７にカテゴリ別に格納するか（ステップ６０２）、未分類の原語文書を入力して原語未分類文書９に格納するか（ステップ６０３）、未分類の訳語文書を入力して訳語未分類文書１０に格納するか（ステップ６０４）、又は終了（ＥＮＤ）するかを実行する。 Whether the input means 1 captures the processing selection by the user, and according to the selection, designates the lowest category and inputs the source language document classified by the lowest category and stores it in the source language classified document 7 by category (Step 602), whether an uncategorized source language document is input and stored in the source language uncategorized document 9 (step 603), or whether an unclassified translated word document is input and stored in the translated word uncategorized document 10 (step 604) Or end (END).

（Ｂ−２−２）コアワード作成処理
図１９は、コアワード作成処理の動作を説明するフローチャートである。 (B-2-2) Core Word Creation Process FIG. 19 is a flowchart for explaining the operation of the core word creation process.

図２において、コアワード作成処理が選択されると（ステップ２０１及び２０３）、図１９の処理を実行する。 In FIG. 2, when the core word creation process is selected (steps 201 and 203), the process of FIG. 19 is executed.

図１９において、コアワード作成手段２により、文書格納処理で最下層のカテゴリ別に格納された原語分類済文書の文書を用いて、最下層のカテゴリのコアワードが作成される（７０１）。なお、最下層のカテゴリのコアワードの作成処理についての詳細は図２０を用いて後述する。 In FIG. 19, the core word creating means 2 creates the core word of the lowest category using the documents of the original language classified documents stored for each lower category in the document storing process (701). Details of the process of creating the core word of the lowest category will be described later with reference to FIG.

最下層のカテゴリのコアワードが作成されると、コアワード作成手段２により、その作成された最下層のカテゴリのコアワードを用いて中間層のカテゴリのコアワードが作成される（７０２）。なお、中間層のカテゴリのコアワードの作成処理についての詳細は図２１を用いて後述する。 When the core word of the lowest category is created, the core word creation means 2 creates the core word of the middle category using the created core word of the lowest category (702). Details of the process of creating the core word of the middle layer category will be described later with reference to FIG.

（Ｂ−２−２−１）最下層のカテゴリのコアワード作成処理
図２０は、最下層のカテゴリのコアワード作成処理の動作を説明するフローチャートである。 (B-2-2-1) Core Word Creation Processing of Lowermost Category Category FIG. 20 is a flowchart for explaining the operation of core word creation processing of the lowest category.

図２０において、コアワード作成手段２により、カテゴリ関係辞書１３に格納されているカテゴリのうち、最下層のカテゴリが取り出される（ステップ７１１）。 In FIG. 20, the core word creating means 2 takes out the lowest category among the categories stored in the category relation dictionary 13 (step 711).

この最下層のカテゴリとは、子カテゴリを持たないカテゴリであり、例えば、図１７に示すカテゴリ辞書１３の例において、カテゴリ名が親カテゴリの項目として上げられていないカテゴリである。例えば、図１７において、「野球」は親カテゴリに上げられていないから、最下層のカテゴリである。 The lowermost category is a category having no child category. For example, in the example of the category dictionary 13 illustrated in FIG. 17, the category name is not listed as a parent category item. For example, in FIG. 17, “baseball” is not listed as a parent category, and is therefore the lowest category.

最下層のカテゴリが取り出されると、その取り出されたすべてのカテゴリに対して、以下で説明するコアワードの作成処理が行なわれているか確認し、すべてのカテゴリに対してコアワード作成処理が終了するまでコアワード作成処理が行なわれる（ステップ７１２）。 When the lowest category is extracted, it is checked whether the core word creation process described below is performed for all the extracted categories, and the core word is processed until the core word creation process is completed for all categories. Creation processing is performed (step 712).

ステップ７１１において、カテゴリ関係辞書１３から最下層のカテゴリが取り出されると、コアワード作成手段２により、その取り出されたカテゴリに分類されている文書が、カテゴリ別に、原語分類済文書７から取り出される（ステップ７１３）。 In step 711, when the lowermost category is extracted from the category relation dictionary 13, the core word creating means 2 extracts documents classified into the extracted category from the source language classified document 7 for each category (step). 713).

最下層のカテゴリの文書が原語分類済文書７から取り出されると、コアワード作成手段２により、その取り出された文書は、形態素解析がなされる（ステップ７１４）。 When the document in the lowermost category is extracted from the source language classified document 7, the extracted word is subjected to morphological analysis by the core word creating means 2 (step 714).

コアワード作成手段２により最下層のカテゴリの文書が形態素解析されると、コアワード作成手段２により、その形態素解析結果から不要語を除いた単語がコアワードとして抽出される（ステップ７１５）。 When the core word creation unit 2 performs morphological analysis on the lowermost category document, the core word creation unit 2 extracts a word obtained by removing unnecessary words from the morpheme analysis result as a core word (step 715).

そして、最下層のカテゴリに対してコアワードが抽出されると、コアワード作成手段２により、第１の実施形態で説明したものと同様に、上記式（３）に従って、分野関連度が計算される（ステップ７１６）。 Then, when the core word is extracted for the lowest category, the core word creation unit 2 calculates the field relevance according to the above equation (3), as described in the first embodiment ( Step 716).

このようにして、コアワード作成手段２により、最下層のカテゴリのコアワードが作成され、その作成されたコアワードの分野関連度が計算されると、そのコアワードと分野関連度とは、コアワード辞書８に格納される（ステップ７１７）。 Thus, when the core word of the lowest category is created by the core word creating means 2 and the field relevance of the created core word is calculated, the core word and the field relevance are stored in the core word dictionary 8. (Step 717).

上述したように、最下層のカテゴリのコアワード作成処理は、Ｓ７１１で取り出した全ての最下層のカテゴリに対してなされる。 As described above, the core word creation process of the lowest category is performed for all the lowest category extracted in S711.

（Ｂ−２−２−２）中間層のカテゴリのコアワード作成処理
図２１は、中間層のカテゴリのコアワード作成処理の動作を説明するフローチャートである。 (B-2-2-2) Intermediate Word Category Core Word Creation Processing FIG. 21 is a flowchart for explaining the operation of the intermediate layer category core word creation processing.

図２０において、コアワード作成手段２により、カテゴリ辞書１３に格納されているカテゴリのうち、最上層のカテゴリを取り出す（ステップ７２１）。 In FIG. 20, the core word creating means 2 takes out the uppermost category from the categories stored in the category dictionary 13 (step 721).

この最上層のカテゴリは、例えば、図１７に示すカテゴリ関係辞書１３の例において、親カテゴリを持たないもの（親カテゴリが「−」のもの）である。 The uppermost category is, for example, one that does not have a parent category (has a parent category of “−”) in the example of the category relation dictionary 13 shown in FIG.

次に、カテゴリ辞書１３から最上層のカテゴリが取り出されると、コアワード作成手段２により、その最上層のカテゴリを親カテゴリにもつカテゴリが、カテゴリ辞書１３から検索される（ステップ７２２）。 Next, when the uppermost category is extracted from the category dictionary 13, the core word creating means 2 searches the category dictionary 13 for a category having the uppermost category as a parent category (step 722).

Ｓ７２２において検索された全てのカテゴリに対して、以下で説明するコアワード作成処理が行なわれているか否かを確認し、すべてのカテゴリに対してコアワード作成処理が終了するまでコアワード作成処理が行なわれる（ステップ７２３）。 It is confirmed whether or not the core word creation processing described below is performed for all categories searched in S722, and the core word creation processing is performed until the core word creation processing is completed for all categories ( Step 723).

Ｓ７２３においてカテゴリ辞書１３から検索されたカテゴリについて、コアワードが作成されていない場合、そのコアワードが作成されていないカテゴリを親カテゴリに持つカテゴリが、コアワード作成手段２により、カテゴリ辞書１３から検索される（ステップ７２４）。 If no core word has been created for the category retrieved from the category dictionary 13 in S723, the category having the category for which the core word has not been created as a parent category is retrieved from the category dictionary 13 by the core word creating means 2 ( Step 724).

Ｓ７２４においてカテゴリ辞書１３から検索されたカテゴリについて、同じカテゴリ（親カテゴリ）を親に持つすべての子カテゴリに対して、コアワードが作成済であるか否かを確認する（ステップ７２５）。 For the categories retrieved from the category dictionary 13 in S724, it is confirmed whether or not a core word has been created for all child categories having the same category (parent category) as a parent (step 725).

Ｓ７２５において、すべての子カテゴリに対してコアワードが作成されているとコアワード作成手段２が確認したときＳ７２６に進み、一方すべての子カテゴリに対してコアワードが作成されていないとコアワード作成手段２が確認したとき、Ｓ７２４に戻り動作が繰り返される（ステップ７２５）。 In S725, when the core word creation means 2 confirms that core words have been created for all child categories, the process proceeds to S726, while the core word creation means 2 confirms that core words have not been created for all child categories. If so, the process returns to S724 and is repeated (step 725).

これにより、親カテゴリは、すべての子カテゴリに対して作成されたコアワードをすべて持つこととなる。 As a result, the parent category has all the core words created for all the child categories.

Ｓ７２６では、コアワード作成手段２により、コアワードが作成され、その作成されたコアワードの分野関連度が計算される（ステップ７２７）。 In S726, a core word is created by the core word creating means 2, and the field relevance of the created core word is calculated (step 727).

ここで、コアワード作成手段２で最下層のカテゴリのコアワードを利用して親カテゴリのコアワードを作成する方法は、子カテゴリのコアワードの偏り具合から判断するというものである。 Here, the method of creating the core word of the parent category using the core word of the lowest category in the core word creating means 2 is to judge from the degree of deviation of the core words of the child category.

子カテゴリに万遍なく存在する場合には親カテゴリのコアワードにし（親カテゴリにも作成する）、いずれかの子カテゴリに突出している場合には親カテゴリのコアワードにしない（子カテゴリにのみ存在する）、という考え方である。 If it exists in the child category evenly, it will be the core word of the parent category (create it in the parent category), if it protrudes in any child category, it will not be the core word of the parent category (exist only in the child category) This is the idea.

手順は、以下の通りである。 The procedure is as follows.

ある親カテゴリの直下の子カテ１ゴリ全体で、コアワード毎に、分野関連度の平均値（ｍｅａｎ）と標準偏差（ｓｄ）を、正規分布と仮定して、計算する。 The average value (mean) and standard deviation (sd) of the field relevance level are calculated for each core word in the entire child category immediately under a certain parent category, assuming a normal distribution.

コアワードがないカテゴリの分野関連度は０として計算する。すべての分野関連度がばらつきの範囲内であれば、そのコアワードを親カテゴリのコアワードにする。 The field relevance of the category having no core word is calculated as 0. If all the field relevance levels are within the range of variation, the core word is set as the core word of the parent category.

例では、範囲を「平均値＋標準偏差×３」（ｍｅａｎ＋３ｓｄ）とする。「平均値＋標準偏差×３」（ｍｅａｎ＋３ｓｄ）の範囲内にデータが入る確率は９９．７３％である。 In the example, the range is “average value + standard deviation × 3” (mean + 3sd). The probability that data falls within the range of “average value + standard deviation × 3” (mean + 3sd) is 99.73%.

範囲の計算は正しくは平均値±プラスマイナスであるが、分野関連度に負の値はないため、マイナスの方は無視してよい。 The calculation of the range is correctly the average value ± plus or minus, but since there is no negative value in the field relevance, the negative one can be ignored.

最下層のカテゴリの分野関連度は、ｔｆ＊ｉｄｆをそのまま利用するとコアワード作成文書の量に左右されるので、カテゴリ毎にコアワードの総数で割った値にしているが、親カテゴリのコアワードにはコアワード作成文書が存在しないので、親カテゴリの分野関連度は、適当に設定しなければならない。 If tf * idf is used as it is, the field relevance of the lowest category depends on the amount of core word creation documents. Therefore, the value is divided by the total number of core words for each category. Since there is no created document, the field relevance level of the parent category must be set appropriately.

ここでは、範囲の上限値「平均値＋標準偏差×３」（ｍｅａｎ＋３ｓｄ）とする。親カテゴリのコアワードの分野関連度はどの子カテゴリの分野関連度よりも高い値にするためである。 Here, the upper limit value of the range is “average value + standard deviation × 3” (mean + 3sd). This is because the field relevance of the core word of the parent category is higher than the field relevance of any child category.

Ｓ７２６において作成されたコアワード及びその分野関連度が計算されると、コアワード作成手段２により、コアワード及びその分野関連度がコアワード辞書８に格納される（ステップ７２７）。 When the core word created in S726 and its field relevance are calculated, the core word creation means 2 stores the core word and its field relevance in the core word dictionary 8 (step 727).

そして、Ｓ７２２に戻り、最上層のカテゴリを親カテゴリに持つすべてのカテゴリのコアワードが作成済になるまで動作が繰り返される。 Then, the process returns to S722, and the operation is repeated until the core words of all categories having the uppermost category as a parent category have been created.

（Ｂ−２−３）コアワード作成処理の実施例
次に、図１６の階層カテゴリにおいて、カテゴリ関係辞書１３が図１７に示す辞書情報を有する場合のコアワード作成処理の実施例について説明する。 (B-2-3) Example of Core Word Creation Process Next, an example of the core word creation process when the category relation dictionary 13 has the dictionary information shown in FIG. 17 in the hierarchical category of FIG. 16 will be described.

図２において、コアワード作成処理が選択されると、まず最下層のカテゴリのコアワード作成処理がなされる（ステップ７０１）。 In FIG. 2, when the core word creation process is selected, first the core word creation process of the lowest category is performed (step 701).

ここでは、例えば図１７における、「野球」、「サッカー」、「プログラミング」、「ＯＳ１」、「ＯＳ２」等の最下層のカテゴリのコアワードが作成され、その作成されたコアワードの分野関連度が計算される。そして、その計算されたコアワード及び分野関連度が、コアワード辞書８に格納される。 Here, for example, the core words of the lowest category such as “baseball”, “soccer”, “programming”, “OS1”, “OS2” in FIG. 17 are created, and the field relevance of the created core words is calculated. Is done. Then, the calculated core word and the field relevance are stored in the core word dictionary 8.

次に、中間層のカテゴリのコアワード作成処理がなされる（ステップ７０２）。 Next, a core word creation process of the category of the intermediate layer is performed (step 702).

コアワード作成手段２により、最上層のカテゴリである「Ｔ０Ｐ」が取り出され（ステップ７２１）、その「Ｔ０Ｐ」を親カテゴリに持つカテゴリ「スポーツ」、「コンピュータ」が、カテゴリ辞書１３から検索される（ステップ７２２）。 The core word creating means 2 extracts the top-level category “T0P” (step 721), and searches the category dictionary 13 for the categories “sports” and “computer” having “T0P” as a parent category (step 721). Step 722).

コアワード作成手段２により、「スポーツ」、「コンピュータ」のすべてについて、コアワードが作成されていないと判断されると、その「スポーツ」、「コンピュータ」を親カテゴリに持つカテゴリ（子カテゴリ）が検索される（ステップ７２４）。 When the core word creation means 2 determines that no core word has been created for all of “sports” and “computer”, a category (child category) having “sports” and “computer” as parent categories is searched. (Step 724).

このとき、「スポーツ」については、「野球」、「サッカー」などが子カテゴリとして検索され、「コンピュータ」については、「ＯＳ」、「プログラミング」などが子カテゴリとして検索される。 At this time, for “sports”, “baseball”, “soccer” and the like are searched as child categories, and for “computer”, “OS”, “programming” and the like are searched as child categories.

ここで、「野球」、「サッカー」、「プログラムミング」等は最下層のカテゴリであるから、Ｓ７０１においてコアワードが作成されている。一方、「ＯＳ」等は最下層でないから、コアワードが作成されていない。 Here, since “baseball”, “soccer”, “programming”, and the like are the lowest category, a core word is created in S701. On the other hand, since “OS” or the like is not the lowest layer, no core word is created.

ステップ７２５において、コアワード作成手段２は、ステップ７２４で検索された各カテゴリ（子カテゴリ）について、すべてコアワードが作成されているか否かを確認し、すべての子カテゴリがコアワードを作成するまで繰り返す。 In step 725, the core word creating means 2 confirms whether or not a core word has been created for each category (child category) searched in step 724, and repeats until all child categories have created a core word.

つまり、「スポーツ」については、子カテゴリである「野球」、「サッカー」のすべてにコアワードが作成されているから、コアワード作成手段２の処理はステップ７２６以降に進み、「スポーツ」についてのコアワードが作成され、そのコアワードの分野関連度が付与され、コアワード辞書８に格納される。 That is, for “sports”, since core words are created for all of the child categories “baseball” and “soccer”, the processing of the core word creation means 2 proceeds to step 726 and subsequent steps, and the core word for “sports” is It is created, the field relevance of the core word is given, and stored in the core word dictionary 8.

一方、「コンピュータ」については、子カテゴリである「ＯＳ」のコアワードが作成されていないから、ステップ７２４に戻り、「ＯＳ」を親カテゴリにもつカテゴリが、カテゴリ関係辞書１３から検索される（ステップ７２４）。つまり、「ＯＳ」の子カテゴリである「ＯＳ１」、「ＯＳ２」等が検索される。 On the other hand, for “computer”, since the core word of the child category “OS” has not been created, the process returns to step 724, and the category having “OS” as the parent category is searched from the category relation dictionary 13 (step). 724). That is, “OS1”, “OS2”, etc., which are child categories of “OS”, are searched.

この「ＯＳ１」、「ＯＳ２」等は最下層のカテゴリであるから、ステップ７０１において、コアワードが作成されている。 Since “OS1”, “OS2”, and the like are the lowest-level categories, a core word is created in step 701.

従って、「ＯＳ」についてのコアワード作成手段２の処理はステップ７２６以降に進み、「ＯＳ」についてのコアワードが作成され、そのコアワードの分野関連度が付与され、コアワード辞書８に格納される。 Accordingly, the processing of the core word creating means 2 for “OS” proceeds to step 726 and subsequent steps, a core word for “OS” is created, the field relevance of the core word is given, and stored in the core word dictionary 8.

続いて、Ｓ７２２に戻り、「ＴＯＰ」の子カテゴリである「コンピュータ」についてコアワードが作成されていないとコアワード作成手段２に判断される。そして、コアワード作成手段２により、「コンピュータ」を親カテゴリとする「ＯＳ」、「プログラミング」等が検索される（ステップ７２２〜７２４）。 Subsequently, the process returns to S722, and it is determined by the core word creating means 2 that a core word has not been created for “computer” which is a child category of “TOP”. Then, the core word creating means 2 searches for “OS”, “programming”, etc. with “computer” as a parent category (steps 722 to 724).

この場合、「ＯＳ」について、上記処理によりコアワードが作成されているので、「コンピュータ」の子カテゴリである「ＯＳ」、「プログラミング」等のすべてにコアワードが作成されているから、コアワード作成手段２の処理はステップ７２６以降に進み、「コンピュータ」についてのコアワードが作成され、そのコアワードの分野関連度が付与され、コアワード辞書８に格納される（ステップ７２５〜７２７）。 In this case, since the core word is created for the “OS” by the above processing, the core word is created in all of the child categories “OS”, “programming”, etc. of the “computer”. The process proceeds to step 726 and subsequent steps, a core word for “computer” is created, the field relevance of the core word is given, and stored in the core word dictionary 8 (steps 725 to 727).

（Ｂ−２−４）中間層のカテゴリの分野関連度付与処理
図２２は、中間層のカテゴリの分野関連度付与処理の動作を説明するフローチャートである。 (B-2-4) Middle-tier category field relevance assignment processing FIG. 22 is a flowchart for explaining the operation of the middle-tier category field relevance assignment processing.

図２２において、コアワード作成手段２により親になるカテゴリ名が取り出される（ステップ７３１）。その親になるカテゴリ名の直下にある子カテゴリのカテゴリ名が取り出される（ステップ７３２）。 In FIG. 22, the parent word category name is extracted by the core word creation means 2 (step 731). The category name of the child category immediately below the parent category name is extracted (step 732).

そして、その直下の子カテゴリに対するコアワードがコアワード辞書８から取り出され（ステップ７３３）、すべてのコアワードに対して上限値の計算が済んでいるならば（ステップ７３４）、終了する。 Then, the core word for the immediate child category is extracted from the core word dictionary 8 (step 733), and if the upper limit value has been calculated for all the core words (step 734), the process is ended.

ステップ７３４において、すべてのコアワードに対して上限値の計算が済んでいない場合、そのコアワードの分野関連度がコアワード辞書８から取り出される（ステップ７３５）。 If the upper limit value has not been calculated for all core words in step 734, the field relevance of the core word is retrieved from the core word dictionary 8 (step 735).

そして、その分野関連度について、最大値が求められ（ステップ７３６）、平均値と標準偏差とが計算され（ステップ７３７）、範囲の上限値が計算され（ステップ７３８）、最大値が上限値より小さいか否かが判断される（ステップ７３９）。 Then, the maximum value is obtained for the field relevance (step 736), the average value and the standard deviation are calculated (step 737), the upper limit value of the range is calculated (step 738), and the maximum value is calculated from the upper limit value. It is determined whether or not it is smaller (step 739).

そして、最大値が上限値より小さい場合、範囲内であるので親カテゴリのコアワードにする（ステップ７４０）し、最大値が上限値以上である場合、ステップ７３４に戻り処理を繰り返す。 If the maximum value is smaller than the upper limit value, it is within the range, so it is set as the core word of the parent category (step 740).

例えば、カテゴリ「スポーツ」のすべての子カテゴリにあるコアワード「選手」の分野関連度の平均値が０．００１３８８、標準偏差が０．０００５１６、最大値が０．００２５７の場合、範囲の上限値は０．００２９４になり、最大値が範囲内であるので、コアワード「選手」はカテゴリ「スポーツ」のコアワードになる。カテゴリ「スポーツ」のコアワード「選手」の分野関連度は０．００２９４になる。 For example, when the average value of the field relevance of the core word “player” in all the child categories of the category “sports” is 0.001388, the standard deviation is 0.000516, and the maximum value is 0.00257, the upper limit value of the range is Since the maximum value is within the range, the core word “player” becomes the core word of the category “sports”. The field relevance of the core word “player” in the category “sports” is 0.00294.

また例えば、カテゴリ「スポーツ」のすべての子カテゴリにあるコアワード「試合」の分野関連度の平均値が０．００１９８５、標準偏差が０．００１８２１、最大値が０．００４９７の場合、範囲の上限値は０．００７４５になり、最大値が範囲内であるので、コアワード「試合」はカテゴリ「スポーツ」のコアワードになる。カテゴリ「スポーツ」のコアワード「試合」の分野関連度は０．００７４５になる。 For example, when the average value of the field relevance of the core word “game” in all child categories of the category “sports” is 0.001985, the standard deviation is 0.001821, and the maximum value is 0.00497, the upper limit value of the range Becomes 0.00745, and the maximum value is within the range, so the core word “game” becomes the core word of the category “sports”. The field relevance of the core word “game” of the category “sports” is 0.00745.

また、カテゴリ「スポーツ」のすべての子カテゴリにあるコアワード「投手」の分野関連度の平均値が０．０００７８３、標準偏差が０．００２５６５、最大値が０．００９２９の場合、範囲の上限値は０．００８４８になり、最大値が範囲を超えているので、コアワード「投手」はカテゴリ「スポーツ」のコアワードにならない。図２３は上記の処理によって作成されたコアワード辞書８の例である。 In addition, when the average value of the field relevance of the core word “pitcher” in all child categories of the category “sports” is 0.000783, the standard deviation is 0.002565, and the maximum value is 0.00929, the upper limit value of the range is Since it becomes 0.00848 and the maximum value exceeds the range, the core word “pitcher” does not become the core word of the category “sports”. FIG. 23 shows an example of the core word dictionary 8 created by the above processing.

（Ｂ−３）第２の実施形態の効果
以上、第２の実施形態によれば、第１の実施形態と同様の効果を奏する。 (B-3) Effects of the Second Embodiment As described above, according to the second embodiment, the same effects as those of the first embodiment can be obtained.

また、第２の実施形態によれば、階層構造の各カテゴリに前もってコアワードを作成しておくことによって、階層の深さに関係なく一度に語を分類することができる。その際、全てのカテゴリに対して属する文書を用意する必要はないので、記憶資源の節約にも寄与する。 Further, according to the second embodiment, by generating core words in advance for each category of the hierarchical structure, words can be classified at a time regardless of the depth of the hierarchy. At this time, it is not necessary to prepare documents belonging to all categories, which contributes to saving of storage resources.

（Ｃ）他の実施形態
（Ｃ−１）上述した第１及び第２の実施形態に係るシステムは、自然言語の翻訳のために、単語、複合語、語句、対訳語句などを辞書に登録する辞書登録システムに適用することが可能である。これにより、人間が判定する手間が省け自動的に、対訳語句を分類して辞書に登録することができる。 (C) Other Embodiments (C-1) The systems according to the first and second embodiments described above register words, compound words, phrases, bilingual phrases, and the like in a dictionary for natural language translation. It can be applied to a dictionary registration system. As a result, it is possible to automatically classify bilingual phrases and register them in the dictionary without the need for human judgment.

（Ｃ−２）上述した第１及び第２の実施形態では、判定対象語を判定したカテゴリに格納し結果をユーザに出力すると説明したが、ユーザに結果を出力しなくてもよいし、カテゴリ辞書に格納する前に出力してカテゴリをユーザに選択させるようにしてもよい。 (C-2) In the first and second embodiments described above, the determination target word is stored in the determined category and the result is output to the user. However, the result may not be output to the user, and the category The data may be output before being stored in the dictionary so that the user can select a category.

分野判定度のリストを作成するまでの手順を利用して、対訳ではない語を分野判定することも可能である。さらに、対訳ではない語の分野判定度のリストを格納しておいて、対訳の語を分野判定する際に利用してもよい。 It is also possible to determine the field of words that are not parallel translations using the procedure up to the creation of the list of field determination degrees. Further, a list of field judgment degrees of words that are not parallel translations may be stored and used when the field of bilingual words is subject to field judgment.

（Ｃ−３）また、第１の実施形態の冒頭で説明した各種用語の定義は、システム運用などに応じて変えることができる。 (C-3) Also, the definitions of various terms explained at the beginning of the first embodiment can be changed according to the system operation and the like.

例えば、コアワードや不要語の作成は品詞の種類を変更したりｎグラムで切り出したりなどの別の方法で定義してもよいし、追加や削除ができるようにしてもよい。さらに共起関係は、修飾関係などの別の方法で定義してもよいし、抽出する範囲を広くしたり狭くしたりしてもよい。また、分野関連度やその重み付けは、見出しに含まれる語は高くしたり、語間の距離を反映したり、などの別の方法で計算してもよいし、語を指定して調整できるようにしてもよい。中間層の分野関連度の計算は、範囲や範囲の上限値の定義を変更してもよい。また、各文書データベースは、ネットワークを介して参照するようにしてもよい。 For example, the creation of a core word or an unnecessary word may be defined by another method such as changing the type of part of speech or cutting out with an n-gram, or may be added or deleted. Furthermore, the co-occurrence relationship may be defined by another method such as a modification relationship, or the range to be extracted may be widened or narrowed. Also, the field relevance and its weight may be calculated by other methods, such as increasing the word included in the heading or reflecting the distance between words, and can be adjusted by specifying the word It may be. The calculation of the field relevance of the middle layer may change the definition of the range and the upper limit value of the range. Each document database may be referred to via a network.

（Ｃ−４）上述した第１及び第２の実施形態で説明した構成要素（１〜１３）は、１台の装置に実装せず、複数台の装置に分散して配置するようにしてもよい。例えば、原語分類済文書７、原語未分類文書９、誤訳未分類文書１０などは、ネットワーク経由でアクセス可能なＷｅｂサイトとして置き換えてもよい。 (C-4) The components (1 to 13) described in the first and second embodiments described above may not be mounted on one device, but may be distributed and arranged on a plurality of devices. Good. For example, the source language classified document 7, the source language unclassified document 9, the mistranslation unclassified document 10, and the like may be replaced as a website accessible via a network.

（Ｃ−５）また、分類対象は、単語だけでなく、複合語や句であってもよい。さらに、上述した第１及び第２の実施形態で説明した各種辞書の構成や各リストの構成などは図示したものに限定されない。 (C-5) Further, the classification target may be not only a word but also a compound word or a phrase. Furthermore, the configurations of various dictionaries and the configurations of the lists described in the first and second embodiments described above are not limited to those illustrated.

（Ｃ−６）以上の説明でハードウェア的に実現した機能の大部分はソフトウェア的に実現することができ、ソフトウェア的に実現したほとんどすべてはハードウェア的に実現することが可能である。 (C-6) Most of the functions realized in hardware in the above description can be realized in software, and almost all realized in software can be realized in hardware.

第１の実施形態の対訳語句分類システムの機能構成図である。It is a functional block diagram of the bilingual phrase classification system of 1st Embodiment. 第１の実施形態における全体動作を示すフローチャートである。It is a flowchart which shows the whole operation | movement in 1st Embodiment. 第１の実施形態における文書格納処理を示すフローチャートである。It is a flowchart which shows the document storage process in 1st Embodiment. 第１の実施形態におけるコアワード作成処理を示すフローチャートである。It is a flowchart which shows the core word creation process in 1st Embodiment. 第１の実施形態における対訳分野判定処理を示すフローチャートである。It is a flowchart which shows the bilingual field determination process in 1st Embodiment. 第１の実施形態における判定対象原語の分野判定度の計算処理を示すフローチャートである。It is a flowchart which shows the calculation process of the field determination degree of the determination target original word in 1st Embodiment. 第１の実施形態における判定対象訳語の分野判定度の計算処理を示すフローチャートである。It is a flowchart which shows the calculation process of the field determination degree of the determination target translated word in 1st Embodiment. 第１の実施形態における判定対象語の統合分野判定度の計算処理を示すフローチャートである。It is a flowchart which shows the calculation process of the integrated field determination degree of the determination object word in 1st Embodiment. 第１の実施形態における単語辞書の構成例を示す図である。It is a figure which shows the structural example of the word dictionary in 1st Embodiment. 第１の実施形態の判定対象原語の分野判定度リストの構成例を示す図である。It is a figure which shows the structural example of the field | area determination degree list | wrist of the determination target original language of 1st Embodiment. 第１の実施形態の判定対象訳語、共起にある語、原語訳及びコアワードとの関係を示す図である。It is a figure which shows the relationship with the judgment object translation of 1st Embodiment, the word in co-occurrence, an original word translation, and a core word. 第１の実施形態の判定対象訳語の分野判定度リストの構成例を示す図である。It is a figure which shows the structural example of the field | area determination degree list | wrist of the determination target translated word of 1st Embodiment. 第１の実施形態の判定対象語の分野判定度リストの構成を示す図である。It is a figure which shows the structure of the field | area determination degree list | wrist of the determination target word of 1st Embodiment. 第１の実施形態のカテゴリ辞書の構成例を示す図である。It is a figure which shows the structural example of the category dictionary of 1st Embodiment. 第２の実施形態の対訳語句分類システムの機能構成図である。It is a function block diagram of the bilingual phrase classification system of 2nd Embodiment. 第２の実施形態における階層カテゴリを説明する説明図である。It is explanatory drawing explaining the hierarchy category in 2nd Embodiment. 第２の実施形態のカテゴリ関係辞書の構成例を示す図である。It is a figure which shows the structural example of the category relation dictionary of 2nd Embodiment. 第２の実施形態における文書格納処理を示すフローチャートである。It is a flowchart which shows the document storage process in 2nd Embodiment. 第２の実施形態におけるコアワード作成処理を示すフローチャートである。It is a flowchart which shows the core word creation process in 2nd Embodiment. 第２の実施形態における最下層のコアワード作成処理を示すフローチャートである。It is a flowchart which shows the core word creation process of the lowest layer in 2nd Embodiment. 第２の実施形態における中間層のコアワード作成処理を示すフローチャートである。It is a flowchart which shows the core word creation process of the intermediate | middle layer in 2nd Embodiment. 第２の実施形態における中間層のカテゴリの分野関連度付与処理を示すフローチャートである。It is a flowchart which shows the field related degree provision process of the category of the intermediate | middle layer in 2nd Embodiment. 第２の実施形態におけるコアワード辞書の構成例を示す図である。It is a figure which shows the structural example of the core word dictionary in 2nd Embodiment.

Explanation of symbols

１…入力手段、２…コアワード作成手段、３…原語判定手段、４…訳語判定手段、５…対訳判定手段、６…出力手段、７…原語分類済文書、８…コアワード辞書、９…原語未分類文書、１０…訳語未分類文書、１１…単語辞書、１２…カテゴリ辞書、１３…カテゴリ関係辞書。

DESCRIPTION OF SYMBOLS 1 ... Input means, 2 ... Core word preparation means, 3 ... Original word determination means, 4 ... Translation word determination means, 5 ... Parallel translation determination means, 6 ... Output means, 7 ... Original word classified document, 8 ... Core word dictionary, 9 ... Original word not yet Classification document, 10 ... untranslated word document, 11 ... word dictionary, 12 ... category dictionary, 13 ... category relation dictionary.

Claims

In the bilingual phrase classification system that determines each category of the input original word of the determination target word and its translation, and classifies the determination target word by category,
Category representative word creating means for creating one or a plurality of category representative words that characteristically express a certain category and creating a field relevance level indicating the characteristic degree of each category representative word in a document classified by category; ,
Category representative word holding means for holding each category representative word created by the category representative word creating means and its field relevance;
One or more words co-occurring with the original word to be judged are extracted from the original word uncategorized document, and among the extracted words having the co-occurrence relation, the field relations of one or more words that are the category representative words are extracted. Source word determining means for taking out the degree from the category representative word holding means and creating a field determination degree list of the determination target original words;
One or more words co-occurring with the target translation are extracted from the unclassified translation document, and one or more words that are the source language translations of the extracted co-occurrence words are determined. A word determination means for taking out the field relevance level of one or more words that are the category representative words from the category representative word holding means, and creating a field determination degree list of the determination target translation words,
The field judgment degrees of the category representative words in the field judgment degree list of the judgment subject original word and the field judgment degree list of the judgment subject translation word are integrated, and an integrated field judgment degree for each judgment subject word is created. Bilingual judgment means;
A bilingual phrase classification system comprising: output means for outputting category classification information of the determination target word based on a category-specific integrated field determination degree created by the parallel determination section.

The translated word determining means weights the number of occurrences of the word that becomes the source word translation in the translated word uncategorized document with respect to the field relevance level of each word that becomes the source word translation of each word having the extracted co-occurrence relationship. 2. The bilingual phrase classification system according to claim 1, wherein a field judgment degree list of the judgment target translation words is created by weighting the number of words to be translated.

The bilingual determination means integrates the field determination degrees of the upper category representative words so as to highly evaluate the field determination degrees of both the determination target original word and the determination target translation words. Item 3. The bilingual phrase classification system according to item 1 or 2.

The above category representative word creation means is:
For a hierarchical category having a predetermined hierarchical structure, a category relationship dictionary unit that holds hierarchical relationship information between each category,
A lowermost category representative word determining unit that determines one or more category representative words that characteristically express the contents of a document classified into any one of the lowermost categories of the hierarchical category;
Referring to the hierarchical relationship information between each category in the category relationship dictionary section, the category representative word of the category corresponding to the parent of the child category including the lowest category is used as the category representative word of the child category. 4. The bilingual phrase classification system according to claim 1, further comprising: an upper category representative word determining unit that sequentially and repeatedly determines and executes an upper category between hierarchical categories.

In the bilingual phrase classification method for determining the category of the input original word of the determination target word and its translation, and classifying the determination target word by category,
The category representative word creation means creates one or a plurality of category representative words that characteristically represent a certain category, and creates a field relevance level indicating the characteristic degree of each category representative word in a document classified by category. And
The category representative holding means holds each category representative word created by the category representative word creating means and its field relevance,
The original language determination means extracts one or more words having a co-occurrence relationship with the determination target original word from the original language uncategorized document, and one or more of the extracted word having the co-occurrence relationship is the category representative word. Taking out the field relevance level of the word from the category representative word holding means, creating a field determination degree list of the determination target original word,
The translation determination unit extracts one or more words co-occurring with the target translation from the untranslated word classification document, and determines one or more words that are the original translations of the extracted co-occurrence words The field relevance of one or more words that are the category representative words out of the words that become the original word translation is extracted from the category representative word holding means, and a field judgment degree list of the judgment target translation words is created.
The bilingual determination unit integrates the field determination degrees of the category representative words in the field determination degree list of the determination target original word and the field determination degree list of the determination target translation word, and integrates the determination target words by category. Create a degree of judgment,
The bilingual phrase classification method, wherein the output means outputs the category classification information of the determination target word based on the category-specific integrated field determination degree created by the parallel translation determination means.

In the bilingual phrase classification device that determines each category of the input original word of the determination target word and its translation, and classifies the determination target word by category,
Category representative word creating means for creating one or a plurality of category representative words that characteristically express a certain category and creating a field relevance level indicating the characteristic degree of each category representative word in a document classified by category; ,
Category representative word holding means for holding each category representative word created by the category representative word creating means and its field relevance;
One or more words co-occurring with the original word to be judged are extracted from the original word uncategorized document, and among the extracted words having the co-occurrence relation, the field relations of one or more words that are the category representative words are extracted. Source word determining means for taking out the degree from the category representative word holding means and creating a field determination degree list of the determination target original words;
One or more words co-occurring with the target translation are extracted from the unclassified translation document, and one or more words that are the source language translations of the extracted co-occurrence words are determined. A word determination means for taking out the field relevance level of one or more words that are the category representative words from the category representative word holding means, and creating a field determination degree list of the determination target translation words,
The field judgment degrees of the category representative words in the field judgment degree list of the judgment subject original word and the field judgment degree list of the judgment subject translation word are integrated, and an integrated field judgment degree for each judgment subject word is created. Bilingual judgment means;
A bilingual phrase classification program, characterized in that, based on a category-specific integrated field determination degree created by the bilingual determination means, output means for outputting category classification information of the determination target word is realized.