JP2008299777A

JP2008299777A - Multilingual word classification device and multilingual word classification program

Info

Publication number: JP2008299777A
Application number: JP2007147849A
Authority: JP
Inventors: Naoto Kato; 直人加藤; Akinori Kinoshita; 明徳木下
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2007-06-04
Filing date: 2007-06-04
Publication date: 2008-12-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a multilingual word classification device and a multilingual word classification program capable of classifying words included in a language into content words and function words without using statistical characteristics of the words even when an electronic dictionary comprising classified word classes cannot be prepared. <P>SOLUTION: When a plurality of languages, for which a word list including a plurality of words is prepared, and a target language, for which content words and function words are not classified, are constructed of the same character string or include the predetermined number or more of same characters, the multilingual word classification device 1 uses the word list to classify words of the target language into content words and function words. The multilingual word classification device 1 is provided with a monolingual interword similarity calculation means 3, a word list group storage means 5, a multilingual interword similarity calculation means 7, and a word determination means 9. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、含まれている単語が内容語と機能語とに分類されていない言語について、当該単語を内容語と機能語とに分類する多言語単語分類装置及び多言語単語分類プログラムに関する。 The present invention relates to a multilingual word classification device and a multilingual word classification program that classify a word into a content word and a function word for a language in which contained words are not classified into a content word and a function word.

従来、ある言語内に含まれる単語について、当該単語が内容語か機能語かを分類する手法の一つとして、単語の品詞の違いに基づく手法が非特許文献１に開示されている。なお、内容語とは、意味を持つ単語のことをいい、例えば、品詞が名詞や動詞の単語であり、機能語とは、内容語以外の単語のことをいい、例えば、品詞が助詞や助動詞の単語である。これら内容語と機能語とは排他的な関係になる。 Conventionally, Non-Patent Document 1 discloses a technique based on a difference in the part of speech of a word as one technique for classifying a word contained in a certain language as a content word or a functional word. Note that the content word means a word having meaning, for example, the part of speech is a noun or verb word, and the function word means a word other than the content word, for example, the part of speech is a particle or auxiliary verb. Is the word. These content words and function words have an exclusive relationship.

そして、従来の非特許文献１に示す手法は、この点に着目し、単語の品詞に基づいて、ある言語内に含まれる単語を内容語と機能語とに分類している。
一般的に、ある言語について、電子辞書が用意できるのであれば、当該電子辞書において、単語の品詞が分類されており、この分類された品詞に基づいて、当該言語内に含まれる単語を内容語と機能語とに分類することが容易に行えるので、従来の手法では、電子辞書が使用されることが多かった。なお、英語のように、電子辞書が容易に入手できるのであれば、英語の単語について、内容語と機能語とに分類することは容易である。 The conventional technique shown in Non-Patent Document 1 focuses on this point and classifies words included in a language into content words and function words based on the part of speech of the words.
In general, if an electronic dictionary can be prepared for a certain language, the part of speech of the word is classified in the electronic dictionary, and based on the classified part of speech, the word included in the language is defined as a content word. In the conventional technique, an electronic dictionary is often used. If an electronic dictionary is easily available as in English, it is easy to classify English words into content words and function words.

また、従来の手法の中には、電子辞書が用意できない言語について、当該言語内に含まれる単語が内容語か機能語かを分類する手法があり、このような手法の一つとして、単語が持つ統計的な特徴を利用する手法が、非特許文献２に開示されている。 In addition, among the conventional methods, there is a method for classifying whether a word included in the language is a content word or a functional word for a language for which an electronic dictionary cannot be prepared. Non-Patent Document 2 discloses a method that uses the statistical features of the method.

この従来の非特許文献２に示す手法は、単語が持つ統計的な特徴として、ある単語が出現したときに、別の単語が出現する確率が高いという共起関係を利用したもので、この共起関係から言語内に含まれる単語が内容語か機能語かを分類している。 This conventional technique shown in Non-Patent Document 2 uses a co-occurrence relationship in which, when a word appears as a statistical feature of a word, there is a high probability that another word will appear. The words included in the language are classified as content words or function words from the origin.

なお、統計的な特徴の一つとして、一般に内容語の出現頻度と機能語の出現頻度とを比較した場合、機能語の出現頻度の方が高く、又、機能語の前後に様々な内容語が出現するという統計的特徴を利用することにより、言語内に含まれる単語が内容語か機能語かを分類することも可能である。
徳永健伸「情報検索と言語処理」東京大学出版会、ｐｐ．１９−２３Ｐ．Ｆ．Ｂｒｏｗｎｅｔａｌ．，Ｃｌａｓｓ−ｂａｓｅｄｎ−ｇｒａｍｍｏｄｅｌｓｏｆｎａｔｕｒａｌｌａｎｇｕａｇｅ，Ｃｏｍｐｕｔａｔｉｏｎａｌｌｉｎｇｕｉｓｔｉｃｓ，ｐｐ．４６７−４７９，１９９２． In addition, as one of the statistical features, when comparing the appearance frequency of content words with the appearance frequency of function words, the appearance frequency of function words is generally higher, and various content words before and after the function words. It is also possible to classify whether a word included in a language is a content word or a functional word by using the statistical feature that appears.
Takenobu Tokunaga “Information Retrieval and Language Processing” The University of Tokyo Press, pp. 19-23 P. F. Brown et al. Class-based n-gram models of natural language, Computational linguistics, pp. 467-479, 1992.

しかしながら、従来の手法では、複数の言語について、これらの言語内に含まれるそれぞれの単語が、内容語か機能語かを分類するためには、それぞれの言語に対応した電子辞書を入手する必要があり、英語のような言語を除き、他の言語では、電子辞書を入手することが困難な場合あり、電子辞書を入手することができなければ、内容語と機能語とに分類することができないという問題がある。 However, in the conventional method, in order to classify whether each word included in these languages is a content word or a functional word, it is necessary to obtain an electronic dictionary corresponding to each language. Yes, except for languages such as English, it is difficult to obtain electronic dictionaries in other languages. If electronic dictionaries are not available, they cannot be classified into content words and function words. There is a problem.

また、単語が持つ統計的な特徴を利用する場合、内容語の統計的な特徴と機能語の統計的な特徴との差があまり無い単語が含まれる言語では、有効でなく、同様に、当該言語に含まれる単語を内容語と機能語とに分類することができないという問題がある。
なお、この内容語の統計的な特徴と機能語の統計的な特徴との差があまり無い単語が含まれる言語の例として、スペイン語が挙げられる。例えば、スペイン語の“ｍｉｎｉｓｔｒｏ”「大臣」は、内容語であるにも拘わらず、機能語の統計的な特徴と同じ特徴（例えば、出現頻度が高く前後に様々な単語が出現するという特徴）を持っており、単語が持つ統計的な特徴を利用しても、正確に内容語に分類することができない。 Also, when using the statistical characteristics of words, it is not effective in languages that contain words that do not differ much between the statistical characteristics of content words and the statistical characteristics of functional words. There is a problem that words included in a language cannot be classified into content words and function words.
Note that Spanish is an example of a language that includes words that do not have much difference between the statistical characteristics of the content words and the functional words. For example, the Spanish “ministro” and “minister” are the same features as the statistical features of the functional words, even though they are content words (for example, features that appear frequently and various words appear before and after). Even if the statistical characteristics of words are used, they cannot be accurately classified into content words.

さらに、単語が持つ統計的な特徴として、内容語と機能語との出現頻度の差を利用しようとする場合、一般的には機能語の出現頻度が内容後の出現頻度を上回るものの、ある言語においては、内容語であるにも拘わらず、機能語の出現頻度と同程度になる単語が存在し、正確に内容語を分類することができない。 Furthermore, as a statistical feature of words, when trying to use the difference in frequency of appearance between content words and function words, the appearance frequency of function words generally exceeds the appearance frequency after the contents, but there is a certain language However, even though it is a content word, there are words that have the same frequency as the appearance of function words, and the content words cannot be classified correctly.

そこで、本発明では、前記した問題を解決し、単語の持つ統計的な特徴を用いることなく、単語の品詞が分類された電子辞書が用意できなくても、当該言語に含まれる単語を内容語と機能語とに分類することができる多言語単語分類装置及び多言語単語分類プログラムを提供することを目的とする。 Therefore, the present invention solves the above-described problem, and without using a statistical feature of the word, even if an electronic dictionary in which the part of speech of the word is classified cannot be prepared, the word included in the language is represented as a content word. It is an object to provide a multilingual word classification device and a multilingual word classification program that can be classified into functional words and functional words.

前記課題を解決するため、請求項１に記載の多言語単語分類装置は、複数の単語を含む単語リストが備えられる複数の言語と、内容語と機能語とが分類されていない対象言語とが、同一の文字列から構成される又は同一の文字を予め設定した個数以上含む場合に、前記単語リストを用い、前記対象言語の単語を、内容語と機能語とに分類する多言語単語分類装置であって、単語リスト群記憶手段と、単言語内単語類似度計算手段と、多言語間単語類似度計算手段と、単語判定手段と、を備える構成とした。 In order to solve the problem, the multilingual word classification device according to claim 1 includes a plurality of languages provided with a word list including a plurality of words, and a target language in which content words and function words are not classified. A multilingual word classification device that classifies words of the target language into content words and function words using the word list when they are composed of the same character string or include a predetermined number or more of the same characters The word list group storage means, the intra-lingual word similarity calculation means, the multilingual word similarity calculation means, and the word determination means are provided.

かかる構成によれば、多言語単語分類装置は、単言語内単語類似度計算手段によって、単語リスト群記憶手段に記憶されている単語リスト群の各単語リストに含まれる単語を構成する文字と、入力された対象言語の単語を構成する文字との一致する個数及び文字が出現する順序に基づいて、単語リストに含まれる複数の単語に対し、単語リストに含まれる単語と対象言語の単語とが類似する度合いを示す単言語内単語類似度を計算し、この単言語内単語類似度が最大となる最大単言語内単語類似度を、言語ごとに出力する。続いて、多言語単語分類装置は、多言語間単語類似度計算手段によって、単言語内単語類似度計算手段で出力された複数の最大単言語内単語類似度について、予め設定した第一閾値未満の最大単言語内単語類似度を除外して、平均値を取る。そして、多言語単語分類装置は、単語判定手段によって、多言語間単語類似度計算手段で計算された平均値が予め設定した第二閾値以上の場合に、対象言語の単語を内容語と判定し、第二閾値未満の場合に前記対象言語の単語を機能語と判定する。 According to such a configuration, the multilingual word classification device includes a character constituting a word included in each word list of the word list group stored in the word list group storage unit by the word similarity calculation unit in the single language, and Based on the number of characters that make up the words of the target language that have been input and the order in which the characters appear, the words contained in the word list and the words in the target language The word similarity in single language indicating the degree of similarity is calculated, and the maximum word similarity in single language that maximizes the word similarity in single language is output for each language. Subsequently, the multilingual word classification device uses a multilingual word similarity calculation unit that is less than a preset first threshold for a plurality of maximum single language word similarities output by the monolingual word similarity calculation unit. The average value is taken by excluding the word similarity in the maximum single language. The multilingual word classification device determines that the word in the target language is a content word when the average value calculated by the multilingual word similarity calculation unit is greater than or equal to a preset second threshold value by the word determination unit. When the value is less than the second threshold, the word of the target language is determined as a function word.

請求項２に記載の多言語単語分類プログラムは、複数の単語を含む単語リストが備えられる複数の言語と、内容語と機能語とが分類されていない対象言語とが、同一の文字列から構成される又は同一の文字を予め設定した個数以上含む場合に、前記単語リストを用い、前記対象言語の単語を、内容語と機能語とに分類するために、複数の前記単語リストからなる単語リスト群を記憶した単語リスト群記憶手段を備えたコンピュータを、単言語内単語類似度計算手段、多言語間単語類似度計算手段、単語判定手段、として機能させる構成とした。 The multilingual word classification program according to claim 2, wherein a plurality of languages provided with a word list including a plurality of words and a target language in which content words and function words are not classified are configured from the same character string. A word list composed of a plurality of the word lists in order to classify the words of the target language into content words and function words using the word list when the number of the same characters is greater than or equal to a predetermined number A computer including a word list group storage unit that stores a group is configured to function as a word similarity calculation unit within a single language, a multilingual word similarity calculation unit, and a word determination unit.

かかる構成によれば、多言語単語分類プログラムは、単言語内単語類似度計算手段によって、単語リスト群記憶手段に記憶されている単語リスト群の各単語リストに含まれる単語を構成する文字と、入力された対象言語の単語を構成する文字との一致する個数及び文字が出現する順序に基づいて、単語リストに含まれる複数の単語に対し、単語リストに含まれる単語と対象言語の単語とが類似する度合いを示す単言語内単語類似度を計算し、この単言語内単語類似度が最大となる最大単言語内単語類似度を、言語ごとに出力し、多言語間単語類似度計算手段によって、単言語内単語類似度計算手段で出力された複数の最大単言語内単語類似度について、予め設定した第一閾値未満の最大単言語内単語類似度を除外して、平均値を取る。そして、多言語単語分類プログラムは、単語判定手段によって、多言語間単語類似度計算手段で計算された平均値が予め設定した第二閾値以上の場合に、対象言語の単語を内容語と判定し、第二閾値未満の場合に対象言語の単語を機能語と判定する。 According to such a configuration, the multilingual word classification program includes the characters constituting the words included in each word list of the word list group stored in the word list group storage unit by the word similarity calculation unit in the single language, Based on the number of characters that make up the words of the target language that have been input and the order in which the characters appear, the words contained in the word list and the words in the target language The word similarity in a single language indicating the degree of similarity is calculated, and the maximum word similarity in a single language that maximizes the word similarity in this single language is output for each language. For the plurality of maximum single-language words similarity output by the single-language word similarity calculation means, the maximum single-language single word similarity less than a preset first threshold is excluded and an average value is taken. Then, the multilingual word classification program determines that the word in the target language is a content word when the average value calculated by the multilingual word similarity calculation unit is greater than or equal to a preset second threshold value by the word determination unit. If it is less than the second threshold, the word of the target language is determined as a function word.

請求項１、２に記載の発明によれば、複数の言語の単語リストに含まれる単語を構成する文字と、対象言語の単語を構成する文字との単言語内単語類似度を、複数の言語について計算し、最大となった最大単言語内単語類似度の平均値をとり、この平均値が閾値以上の場合に内容語と判定することで、単語の持つ統計的な特徴を用いることなく、複数の言語について、単語の品詞が分類された電子辞書が用意できなくても、対象言語に含まれる単語を内容語と機能語とに分類することができる。 According to invention of Claim 1, 2, the word similarity in a single language of the character which comprises the word contained in the word list | wrist of a several language, and the character which comprises the word of a target language is made into several languages. Without using the statistical features of the word, by taking the average value of the maximum word similarity in the single language and determining the content word when this average value is greater than or equal to the threshold, For a plurality of languages, even if it is not possible to prepare an electronic dictionary in which word parts of speech are classified, words included in the target language can be classified into content words and function words.

次に、本発明の実施形態について、適宜、図面を参照しながら詳細に説明する。
（多言語単語分類装置の構成）
図１は、多言語単語分類装置のブロック図である。この図１に示すように、多言語単語分類装置１は、複数の言語について複数の単語を含む単語リストを用いて、入力された対象言語の単語が、内容語か機能語かを判定し、この判定した判定結果を出力するもので、単言語内単語類似度計算手段３と、単語リスト群記憶手段５と、多言語間単語類似度計算手段７と、単語判定手段９と、を備えている。つまり、この多言語単語分類装置１は、判定結果を出力することで、対象言語の単語を内容語と機能語とに分類していると言える。 Next, embodiments of the present invention will be described in detail with reference to the drawings as appropriate.
(Configuration of multilingual word classifier)
FIG. 1 is a block diagram of a multilingual word classification device. As shown in FIG. 1, the multilingual word classification device 1 uses a word list including a plurality of words for a plurality of languages to determine whether the input target language word is a content word or a function word, The determination result is output, and includes a word similarity calculation means 3 in a single language, a word list group storage means 5, a multilingual word similarity calculation means 7, and a word determination means 9. Yes. That is, it can be said that the multilingual word classification device 1 classifies the words of the target language into content words and function words by outputting the determination result.

複数の言語は、対象言語以外の任意数の言語である。なお、任意数とは、対象言語の個数が１であるので、Ｎ−１個（Ｎは自然数）となる。そして、Ｎ−１個の言語は、当該装置１において処理される順に、第一言語、第二言語、・・・、第Ｎ−１言語と扱われる。 The plurality of languages are an arbitrary number of languages other than the target language. Note that the arbitrary number is N−1 (N is a natural number) because the number of target languages is 1. The N-1 languages are treated as a first language, a second language,..., An N-1th language in the order of processing in the device 1.

単語リストは、複数の単語を単にリストアップしたもの（複数の単語を収めたもの）で、含まれる複数の単語が品詞に分類されておらず、内容語と機能語とに分類されていない。 The word list is simply a list of a plurality of words (a plurality of words are stored), and the included words are not classified into parts of speech and are not classified into content words and function words.

そして、複数の言語と対象言語とは、同一の文字から構成される又は同一の文字を含んで構成される関係にあり、例えば、アルファベット（ラテン文字、ヘブライ文字、アラビア文字等）から構成される言語として、英語、ドイツ語、フランス語、スペイン語、ポルトガル語、ロシア語等が挙げられ、漢字を含んで構成される言語として、中国語、日本語が挙げられる。 The plurality of languages and the target language are configured to include the same characters or include the same characters, and include, for example, alphabets (Latin characters, Hebrew characters, Arabic characters, etc.). Languages include English, German, French, Spanish, Portuguese, Russian, etc., and languages that include kanji include Chinese and Japanese.

内容語とは、単独で意味を持つ単語をいい、例えば、日本語で言うと、品詞が名詞や動詞の単語である。機能語とは、単独で意味を持たない単語をいい、例えば、日本語で言うと、品詞が助詞や助動詞の単語である。 A content word is a word that has meaning alone. For example, in Japanese, a part of speech is a noun or verb word. A function word is a word that has no meaning by itself. For example, in Japanese, a part of speech is a word of a particle or auxiliary verb.

この実施形態では、複数の言語に、英語、ポルトガル語、フランス語、ロシア語の４つの言語を採用しており、対象言語にスペイン語を採用している。また、これに限定されず、複数の言語は、同一の文字列から構成される言語又は同一の文字を予め設定した個数以上含む言語であり、単語リストが得られるものであれば、どの様な言語であってもよく、対象言語は、これら複数の言語と同一の文字から構成される言語又は同一の文字を含んで構成される言語であればよい。 In this embodiment, four languages of English, Portuguese, French, and Russian are adopted as a plurality of languages, and Spanish is adopted as a target language. Further, the present invention is not limited to this, and the plurality of languages is a language composed of the same character string or a language including a predetermined number or more of the same characters, as long as a word list can be obtained. The target language may be a language composed of the same characters as the plurality of languages or a language composed of the same characters.

そして、この多言語単語分類装置１は、内容語が、複数の言語間において共通し、似通った単語として存在しており、機能語が、各言語において共通しておらず、独特な単語として存在しているとの前提の元に構築されたものである。例えば、名詞（特に固有名称）等の内容語は、複数の言語間において、表記の違いがあるだけで、読みが同じになる似通った単語として存在しており、助詞、助動詞、前置詞、冠詞等の機能語は、各言語の文法に従った独特の単語として存在している。 In this multilingual word classification device 1, the content words are common to a plurality of languages and exist as similar words, and the functional words are not common to the languages and exist as unique words. It was built based on the premise that For example, content words such as nouns (especially proper names) exist as similar words that have the same reading, but have different notations among multiple languages, such as particles, auxiliary verbs, prepositions, articles, etc. The functional word of exists as a unique word according to the grammar of each language.

単言語内単語類似度計算手段３は、単語リスト群記憶手段５に記憶されている単語リスト群の各単語リストに含まれる単語を構成する文字と、入力された対象言語の単語を構成する文字との一致する個数及び文字が出現する順序に基づいて、単語リストに含まれる単語と対象言語の単語とが類似する度合いを示す単言語内単語類似度を計算し、この単言語内単語類似度が最大となる最大単言語内単語類似度を、言語ごとに出力するものである。 The word similarity calculation means 3 in the single language includes characters constituting the words included in each word list of the word list group stored in the word list group storage means 5 and characters constituting the input words of the target language. Based on the number of matching and the order in which characters appear, the word similarity in a single language indicating the degree of similarity between a word in the word list and a word in the target language is calculated, and the word similarity in this single language The maximum word similarity in a single language that maximizes is output for each language.

つまり、単言語内単語類似度計算手段３は、単語リスト群記憶手段５に記憶されている単語リスト群の単語リストが無くなるまで、ある一つの言語と対象言語とについて単語類似度を計算し、最大単語類似度を出力し、次に、別の一つの言語と対象言語とについて単語類似度を計算し、最大単語類似度を出力していき、それぞれの言語における最大単語類似度を繰り返し出力するものである。 That is, the word similarity calculation unit 3 in the single language calculates the word similarity for one language and the target language until there is no word list in the word list group stored in the word list group storage unit 5. Outputs the maximum word similarity, then calculates the word similarity for another language and the target language, outputs the maximum word similarity, and repeatedly outputs the maximum word similarity in each language Is.

単言語内単語類似度は、単語リストに含まれる単語を構成する文字と、対象言語の単語を構成する文字との２つの単語間において、文字が一致する個数が多ければ多いほど高い値を取ると共に、この一致した文字の出現する順序が一致すればするほど高い値を取るものである。なお、単語リストに含まれる単語を構成する文字の数と、対象言語の単語を構成する文字の数とが近ければ近いほど、この単言語内単語類似度は、高い値を取ることとなる。 The word similarity in a single language takes a higher value as the number of matching characters increases between two words, that is, the characters constituting the word included in the word list and the characters constituting the word of the target language. At the same time, the higher the order in which the matched characters appear, the higher the value. It should be noted that the closer the number of characters making up a word included in the word list and the number of characters making up the word in the target language, the higher the word similarity in a single language.

この単語内単語類似度の計算は、この実施の形態では、文献「加藤ほか『類似した二言語間の放送ニュース記事の自動対応付け』第９回言語処理学会年次大会発表論文集、ｐｐ．３２２−３２５、２００３」に開示されている方法を用いている。 In this embodiment, the word similarity in the word is calculated according to the document “Kato et al.“ Automatic association of broadcast news articles between two similar languages ”, Proceedings of the 9th Annual Conference of the Language Processing Society, pp. 322-325, 2003 "is used.

単語リスト群記憶手段５は、複数の単語リストからなる単語リスト群を記憶したもので、一般的なハードディスク等の記憶媒体によって構成されている。この実施形態では、単語リスト群に、英語の単語リスト、ポルトガル語の単語リスト、フランス語の単語リスト、ロシア語の単語リストが含まれている。そして、これらの単語リストには、各言語の複数の単語が収められており、ここでは、各言語のニュース記事等から得られたものを採用している。 The word list group storage means 5 stores a word list group composed of a plurality of word lists, and is constituted by a general storage medium such as a hard disk. In this embodiment, the word list group includes an English word list, a Portuguese word list, a French word list, and a Russian word list. These word lists contain a plurality of words in each language. Here, words obtained from news articles and the like in each language are adopted.

多言語間単語類似度計算手段７は、単言語内単語類似度計算手段３で出力された複数の最大単言語内単語類似度について、予め設定した第一閾値未満の最大単言語内単語類似度を除外して、平均値を取るものである。なお、この実施形態では、第一閾値を、０．５と設定している。そして、この閾値の値を言語に応じて、適宜変更することで、判定結果の精度を向上させることができる。ちなみに、閾値の値は、複数の言語のいずれかの言語において内容語か機能語かが判明しているデータに対して、評価実験を行った後に実験的に求めたものである。 The multilingual word similarity calculation means 7 calculates the maximum word similarity within a single language that is less than a preset first threshold for the plurality of maximum word similarity within a single language output by the word similarity calculation means 3 within a single language. Is taken and the average value is taken. In this embodiment, the first threshold is set to 0.5. The accuracy of the determination result can be improved by appropriately changing the threshold value according to the language. Incidentally, the threshold value is obtained experimentally after an evaluation experiment is performed on data in which a content word or a function word is known in any of a plurality of languages.

単語判定手段９は、多言語間単語類似度計算手段７で計算された平均値が予め設定した第二閾値以上の場合に、対象言語の単語を内容語と判定し、第二閾値未満の場合に対象言語の単語を機能語と判定するものである。なお、この実施形態では、第二閾値を０．６と設定している。 The word determination unit 9 determines that the target language word is a content word when the average value calculated by the multilingual word similarity calculation unit 7 is equal to or greater than a preset second threshold value, and is less than the second threshold value. The words of the target language are determined as function words. In this embodiment, the second threshold is set to 0.6.

ここで、単言語内単語類似度計算手段３による単言語内単語類似度の計算例と、多言語間単語類似度計算手段７による最大単言語内単語類似度の平均値の計算例と、単語判定手段９による判定結果の例とについて説明する。前記したように、この実施形態では、複数の言語が英語（第一言語）、ポルトガル語（第二言語）、フランス語（第三言語）、ロシア語（第四言語）であり、対象言語がスペイン語である。 Here, a calculation example of the word similarity in a single language by the word similarity calculation means 3 in the single language, a calculation example of the average value of the maximum word similarity in the single language by the multilingual word similarity calculation means 7, and the word An example of a determination result by the determination unit 9 will be described. As described above, in this embodiment, the plurality of languages are English (first language), Portuguese (second language), French (third language), Russian (fourth language), and the target language is Spain. Is a word.

そして、英語の単語リストに、“ｐｒｅｓｉｄｅｎｔ”、“ｍｉｎｉｓｔｅｒ”、“ｖｉｓｉｔ”が収められているとし、ポルトガル語の単語リストに、“ｍｉｎｉｓｔｒｏ”が収められているとし、フランス語の単語リストに、“ｍｉｎｉｓｔｒｅ”が収められているとし、ロシア語の単語リストに、“Токио”が収められているとする。なお、実際には、ここで、列記した単語以外の複数の単語が各単語リストに含まれている。 Then, it is assumed that “present”, “minister”, “visit” are stored in the English word list, “ministro” is stored in the Portuguese word list, and “ Assume that “ministre” is stored, and “Токио” is stored in the Russian word list. In practice, a plurality of words other than the listed words are included in each word list.

そして、この多言語単語分類装置１に、スペイン語の単語として“ｍｉｎｉｓｔｒｏ”が入力されると、まず、多言語単語分類装置１は、単言語内単語類似度計算手段３によって、英語の単語リストに収められている単語について、次のように単言語内単語類似度を計算する。
〈英語（第一言語）の単言語内単語類似度〉
単言語内単語類似度ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“ｐｒｅｓｉｄｅｎｔ”）＝０．２４
・・・（１）
単言語内単語類似度ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“ｍｉｎｉｓｔｅｒ”）＝０．８２
・・・（２）
単言語内単語類似度ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“ｖｉｓｉｔ”）＝０．４６
・・・（３） Then, when “ministro” is input as a Spanish word to the multilingual word classification device 1, first, the multilingual word classification device 1 uses the in-single-language word similarity calculation means 3 to execute the English word list. The word similarity in a single language is calculated for the words contained in
<English (first language) monolingual word similarity>
Word similarity in single language Simmono (“ministro”, “presentent”) = 0.24
... (1)
Monolingual word similarity Simmono (“minitro”, “minister”) = 0.82
... (2)
Word similarity in single language Simmono (“ministro”, “visit”) = 0.46
... (3)

（１）において、英語の“ｐｒｅｓｉｄｅｎｔ”は、ｐが１個、ｒが１個、ｅが２個、ｓが１個、ｄが１個、ｎが１個、ｔが１個の文字数が９個の文字列として構成されており、スペイン語の“ｍｉｎｉｓｔｒｏ”は、ｍが１個、ｉが２個、ｎが１個、ｓが１個、ｔが１個、ｒが１個、ｏが１個の文字数が８個の文字列として構成されており、これらの文字列を比較すると、文字数が異なり、ｓとｔとについて、含まれている文字が一致しているものの、出現する順序が異なるので、単言語内単語類似度は、０．２４と比較的低くなっている。 In (1), “President” in English has 1 character for p, 1 for r, 2 for e, 1 for s, 1 for d, 1 for n, 1 for t, and 1 for t The Spanish “ministro” is composed of 1 m, 2 i, 1 n, 1 s, 1 t, 1 r, 1 o The number of one character is configured as eight character strings. When these character strings are compared, the number of characters is different, and although the included characters match for s and t, the order in which they appear is Since they are different, the word similarity in monolingual is relatively low at 0.24.

（２）において、英語の“ｍｉｎｉｓｔｅｒ”は、ｍが１個、ｉが２個、ｎが１個、ｓが１個、ｅが１個、ｒが１個の文字数が８個の文字列として構成されており、スペイン語の“ｍｉｎｉｓｔｒｏ”は、ｍが１個、ｉが２個、ｓが１個、ｔが１個、ｒが１個、ｏが１個の文字数が８個の文字列として構成されており、これらの文字列を比較すると、文字数が一致すると共に、ｍとｉとｎとｓとｔとについて、含まれている文字が一致しており、しかも、これらのすべての文字について出現する順序も同じであるので、単言語内単語類似度は、０．８２と比較的高くなっている。 In (2), “minister” in English is a character string of 8 characters with m = 1, i = 2, n = 1, s = 1, e = 1, r = 1 The Spanish "ministro" is a character string of 8 characters with 1 m, 2 i, 1 s, 1 t, 1 r, 1 o When these character strings are compared, the number of characters matches, and the included characters match for m, i, n, s, and t, and all these characters Since the appearance order of is also the same, the word similarity in monolingual is relatively high at 0.82.

（３）において、英語の“ｖｉｓｉｔ”は、ｖが１個、ｉが２個、ｓが１個、ｔが１個の文字数が５個の文字列として構成されており、スペイン語の“ｍｉｎｉｓｔｒｏ”は、ｍが１個、ｉが２個、ｓが１個、ｔが１個、ｒが１個、ｏが１個の文字数が８個の文字列として構成されており、これらの文字列を比較すると、文字数が異なり、ｉとｔとについて、含まれている文字が一致しており、ｉについては出現する順序も同じであるので、単言語内単語類似度は、０．４６という（１）と（２）と間の値となっている。 In (3), “visit” in English is configured as a character string with 5 characters, 1 for v, 2 for i, 1 for s, 1 for t, and “ministro” for Spanish. "" Is configured as a character string of 8 characters, where m is 1, i is 2, s is 1, t is 1, r is 1 and o is 1 character string. , The number of characters is different, the included characters are the same for i and t, and the appearance order is the same for i, so the word similarity in monolingual is 0.46 ( The value is between 1) and (2).

これらの結果から、英語の最大単言語内単語類似度は、０．８２（（２）の０．８２＞（３）の０．４６＞（１）の０．２４）となり、多言語単語分類装置１は、単言語内単語類似度計算手段３によって、この０．８２を、多言語間単語類似度計算手段７に出力する。 From these results, the word similarity in the maximum single language of English is 0.82 (0.82 of (2)> 0.46 of (3)> 0.24 of (1)), and the multilingual word classification The device 1 outputs this 0.82 to the multilingual word similarity calculation means 7 by the word similarity calculation means 3 in the single language.

また、多言語単語分類装置１は、単言語内単語類似度計算手段３によって、ポルトガル語の単語リストに収められている単語“ｍｉｎｉｓｔｒｏ”と、フランス語の単語リストに収められている単語“ｍｉｎｉｓｔｒｅ”と、ロシア語の単語リストに収められている単語“Токио”とについて、次のように単言語内単語類似度を計算する。 Further, the multilingual word classification device 1 uses the word similarity calculation means 3 in the single language to search for the word “ministro” stored in the Portuguese word list and the word “ministre” stored in the French word list. And the word similarity in monolingual for the word “Токио” in the Russian word list is calculated as follows:

〈ポルトガル語（第二言語）の単言語内単語類似度〉
ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“ｍｉｎｉｓｔｒｏ”）＝１
・・・（４）
〈フランス語（第三言語）の単言語内単語類似度〉
ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“ｍｉｎｉｓｔｒｅ”）＝０．８８
・・・（５）
〈ロシア語（第四言語）の単言語内単語類似度〉
ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“Токио”）＝０．１５
・・・（６） <Word similarity in monolingual Portuguese (second language)>
Simmono (“ministro”, “ministro”) = 1
... (4)
<French (third language) monolingual word similarity>
Simmono (“ministro”, “ministre”) = 0.88
... (5)
<Russian (fourth language) monolingual word similarity>
Simmono (“minitro”, “Токио”) = 0.15
... (6)

（４）において、ポルトガル語の“ｍｉｎｉｓｔｒｏ”は、ｍが１個、ｉが２個、ｎが１個、ｓが１個、ｔが１個、ｒが１個、ｏが１個の文字数が８個の文字列として構成されており、スペイン語の“ｍｉｎｉｓｔｒｏ”は、ｍが１個、ｉが２個、ｎが１個、ｓが１個、ｔが１個、ｒが１個、ｏが１個の文字数が８個の文字列として構成されており、これらの文字列を比較すると、完全に一致しているので、単言語内単語類似度は、１．０と最も高くなっている。 In (4), the Portuguese word “ministro” has one letter m, two letters i, one letter n, one letter s, one letter t, one letter r, and one letter o It is composed of 8 character strings. The Spanish word “ministro” has 1 m, 2 i, 1 n, 1 s, 1 t, 1 r, o Is composed of 8 character strings, and these character strings are completely matched, so the word similarity in monolingual is the highest at 1.0. .

（５）において、フランス語の“ｍｉｎｉｓｔｒｅ”は、ｍが１個、ｉが２個、ｎが１個、ｓが１個、ｔが１個、ｒが１個、ｅが１個の文字数が８個の文字列として構成されており、スペイン語の“ｍｉｎｉｓｔｒｏ”は、ｍが１個、ｉが２個、ｎが１個、ｓが１個、ｔが１個、ｒが１個、ｏが１個の文字数が８個の文字列として構成されており、これらの文字列を比較すると、文字数が一致すると共に、ｍとｉとｎとｓとｔとｒとについて、含まれている文字が一致しており、しかも、これらのすべての文字について出現する順序も同じであるので、単言語内単語類似度は、０．８８と比較的高くなっている。 In (5), “ministre” in French has 1 m, 2 i, 1 n, 1 s, 1 t, 1 r, 1 e, and 8 characters. The Spanish “ministro” is composed of 1 m, 2 i, 1 n, 1 s, 1 t, 1 r, 1 o The number of one character is configured as eight character strings. When these character strings are compared, the number of characters matches, and the included characters for m, i, n, s, t, and r are as follows. In addition, since the order of appearance for all these characters is the same, the word similarity in monolingual is relatively high at 0.88.

（６）において、ロシア語の“Токио”は、Тが１個、оが２個、кが１個、иが１個の文字数が５個の文字列として構成されており、スペイン語の“ｍｉｎｉｓｔｒｏ”は、ｍが１個、ｉが２個、ｎが１個、ｓが１個、ｔが１個、ｒが１個、ｏが１個の文字数が８個の文字列として構成されており、これらの文字列を比較すると、文字数が異なり、文字の対応関係が不明であり、出現する順序も不明であるので、単言語内単語類似度は、０．１５と低くなっている。そして、これらの単言語内単語類似度が最大単言語内単語類似度として、多言語間単語類似度計算手段７に出力される（これらの言語における他の単語については省略する）。 In (6), the Russian word “Токио” is composed of a character string consisting of one letter Т, two letters о, one letter к, and one letter и. “ministro” is configured as a character string of 8 characters, where m is 1, i is 1, n is 1, s is 1, t is 1, r is 1 and o is 1 When these character strings are compared, the number of characters is different, the correspondence between the characters is unknown, and the order of appearance is also unknown, so the word similarity in monolingual is as low as 0.15. These single-language word similarities are output as the maximum single-language word similarity to the multilingual word similarity calculation means 7 (other words in these languages are omitted).

そうすると、多言語単語分類装置１は、多言語間単語類似度計算手段７によって、次のように最大単言語内単語類似度の平均値を取る。
まず、多言語単語分類装置１は、多言語間単語類似度計算手段７によって、単言語内単語類似度計算手段３から出力された単言語内単語類似度について、第一閾値未満である単言語内単語類似度を除外する。ここでは、第一閾値が０．５と設定されているので、（６）のロシア語の最大単言語内単語類似度０．１５が除外される。
〈英語（第一言語）の単言語内単語類似度〉
ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“ｍｉｎｉｓｔｅｒ”）＝０．８２
・・・（２）
〈ポルトガル語（第二言語）の単言語内単語類似度〉
ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“ｍｉｎｉｓｔｒｏ”）＝１
・・・（４）
〈フランス語（第三言語）の単言語内単語類似度〉
ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“ｍｉｎｉｓｔｒｅ”）＝０．８８
・・・（５）
〈ロシア語（第四言語）の単言語内単語類似度〉
ＳｉｍＭｏｎｏ（“ｍｉｎｉｓｔｒｏ”，“Токио”）＝０．１５
・・・（６） Then, the multilingual word classification device 1 takes the average value of the word similarity in the maximum single language by the multilingual word similarity calculation means 7 as follows.
First, the multilingual word classification device 1 uses a multilingual word similarity calculation unit 7 to calculate a monolingual word similarity that is less than the first threshold for the word similarity within a single language output from the word similarity calculation unit 3 within a single language. Exclude internal word similarity. In this case, since the first threshold value is set to 0.5, the maximum single-language word similarity 0.15 of Russian in (6) is excluded.
<English (first language) monolingual word similarity>
Simmono (“ministro”, “minister”) = 0.82
... (2)
<Word similarity in monolingual Portuguese (second language)>
Simmono (“ministro”, “ministro”) = 1
... (4)
<French (third language) monolingual word similarity>
Simmono (“ministro”, “ministre”) = 0.88
... (5)
<Russian (fourth language) monolingual word similarity>
Simmono (“minitro”, “Токио”) = 0.15
... (6)

そして、多言語単語分類装置１は、多言語間単語類似度計算手段７によって、英語の最大単言語内単語類似度０．８２と、ポルトガル語の最大単言語内単語類似度１．０と、フランス語の最大単言語内単語類似度０．８８との平均値を取る。
平均値ＳｉｍＭｕｌｔｉ（“ｍｉｎｉｓｔｒｏ”）＝（０．８２＋１．０＋０．８８）／３＝０．９・・・（７） Then, the multilingual word classification device 1 uses the multilingual word similarity calculation means 7 to calculate the maximum English word similarity in single language 0.82 and the Portuguese maximum word similarity in single language 1.0. The average value of the maximum word similarity in French of 0.88 is taken.
Mean value Simulti (“minitro”) = (0.82 + 1.0 + 0.88) /3=0.9 (7)

（７）に示したように、０．９となり、多言語単語分類装置１は、多言語間単語類似度計算手段７によって、この０．９を単語判定手段９に出力する。そして、多言語単語分類装置１は、単語判定手段９によって、平均値０．９と、設定されている第二閾値０．６とを比較し、平均値が第二閾値以上であるので、スペイン語の“ｍｉｎｉｓｔｒｏ”を内容語と判定する。 As shown in (7), 0.9 is obtained, and the multilingual word classification device 1 outputs this 0.9 to the word determination means 9 by the multilingual word similarity calculation means 7. Then, the multilingual word classification device 1 compares the average value 0.9 with the set second threshold value 0.6 by the word determination means 9 and the average value is equal to or greater than the second threshold value. The word “ministro” is determined as the content word.

多言語単語分類装置１によれば、単語リスト群記憶手段５に記憶されている複数の言語の単語リストに含まれる単語を構成する文字と、対象言語の単語を構成する文字との単言語内単語類似度を、単言語内単語類似度計算手段３によって、複数の言語について計算し、多言語間単語類似度計算手段７によって、最大となった最大単言語内単語類似度の平均値をとり、この平均値が第二閾値以上の場合に内容語と判定することで、単語の持つ統計的な特徴を用いることなく、複数の言語について、単語の品詞が分類された電子辞書が用意できなくても、対象言語に含まれる単語を内容語と機能語とに分類することができる。 According to the multilingual word classification device 1, a single language includes characters constituting a word included in a word list of a plurality of languages stored in the word list group storage unit 5 and characters constituting a word of the target language. The word similarity is calculated for a plurality of languages by the word similarity calculation means 3 in the single language, and the average value of the maximum word similarity in the single language is obtained by the multilingual word similarity calculation means 7. When the average value is equal to or greater than the second threshold, it is not possible to prepare an electronic dictionary in which word parts of speech are classified for a plurality of languages without using the statistical characteristics of the word. However, words included in the target language can be classified into content words and function words.

（多言語単語分類装置の動作）
次に、図２に示すフローチャートを参照して、多言語単語分類装置１の動作を説明する（適宜、図１参照） (Operation of multilingual word classifier)
Next, the operation of the multilingual word classification device 1 will be described with reference to the flowchart shown in FIG. 2 (see FIG. 1 as appropriate).

まず、多言語単語分類装置１は、単言語内単語類似度計算手段３によって、入力された対象言語の単語を構成する文字と、単語リスト群記憶手段５に記憶されている第ｉ言語（一から四）について、単言語内単語類似度をそれぞれ計算し、最大単言語内単語類似度を多言語間単語類似度計算手段７に出力する（ステップＳ１）。 First, the multilingual word classification apparatus 1 uses the word similarity calculation means 3 in the monolingual word to form the words constituting the word of the target language and the i-th language stored in the word list group storage means 5 (one To 4), the word similarity in single language is calculated, and the maximum word similarity in single language is output to the multilingual word similarity calculation means 7 (step S1).

続いて、多言語単語分類装置１は、単言語内単語類似度計算手段３によって、第ｉ言語がＮ−１に達したか否かを判定し（ステップＳ２）、達すまで繰り返し（ステップＳ２でＮｏ）、達した場合（ステップＳ２でＹｅｓ）に、多言語間単語類似度計算手段７によって、第一閾値未満の最大単言語内単語類似度を除外して、平均値（多言語間単語類似度）を計算する（ステップＳ３）。 Subsequently, the multilingual word classification device 1 determines whether or not the i-th language has reached N-1 by the word similarity calculation means 3 in the single language (step S2), and repeats until it reaches (in step S2). No), if reached (Yes in step S2), the multilingual word similarity calculation means 7 excludes the maximum monolingual word similarity less than the first threshold value, and calculates the average value (multilingual word similarity Degree) is calculated (step S3).

そして、多言語単語分類装置１は、単語判定手段９によって、多言語間単語類似度計算手段７から出力された平均値が第二閾値以上か否かを判定する（ステップＳ４）。そして、多言語単語分類装置１は、単語判定手段９によって、平均値が第二閾値以上であると判定した場合（ステップＳ４、Ｙｅｓ）に、内容語と判定し（ステップＳ５）、平均値が第二閾値以上であると判定しなかった場合（ステップＳ４、Ｎｏ）に機能語と判定し（ステップＳ６）、判定結果（内容語又は機能語）を出力して（ステップＳ７）、動作を終了する。 Then, the multilingual word classification device 1 determines whether or not the average value output from the multilingual word similarity calculation unit 7 is equal to or more than the second threshold by the word determination unit 9 (step S4). And the multilingual word classification | category apparatus 1 determines with a content word, when it determines with the word determination means 9 having an average value more than a 2nd threshold value (step S4, Yes), and an average value is (step S5). When it is not determined that the value is greater than or equal to the second threshold (No in step S4), the function word is determined (step S6), and the determination result (content word or function word) is output (step S7), and the operation is terminated. To do.

以上、本発明の実施形態について説明したが、本発明は前記実施形態には限定されない。例えば、本実施形態では、多言語単語分類装置として説明したが、各手段における処理を実行する機能プログラムを組み合わせた多言語単語分類プログラムとして構成することも可能である。 As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment. For example, although this embodiment has been described as a multilingual word classification device, it can also be configured as a multilingual word classification program that is a combination of function programs that execute processing in each means.

本発明の実施形態に係る多言語単語分類装置のブロック図である。It is a block diagram of the multilingual word classification device concerning the embodiment of the present invention. 図１に示した多言語単語分類装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the multilingual word classification | category apparatus shown in FIG.

Explanation of symbols

１多言語単語分類装置
３単語類似度計算手段
５単語リスト群記憶手段
７多言語間単語類似度計算手段
９単語判定手段 DESCRIPTION OF SYMBOLS 1 Multilingual word classification device 3 Word similarity calculation means 5 Word list group storage means 7 Multilingual word similarity calculation means 9 Word determination means

Claims

A plurality of languages provided with a word list including a plurality of words and a target language in which content words and function words are not classified are composed of the same character string or include a predetermined number or more of the same characters. A multilingual word classification device for classifying words of the target language into content words and function words using the word list,
A word list group storage means for storing a word list group consisting of a plurality of the word lists;
The number and characters of characters that make up the words included in each word list of the word list group stored in the word list group storage unit and the characters that make up the input word of the target language appear. Based on the order, for a plurality of words included in the word list, a word similarity in a single language indicating a degree of similarity between the word included in the word list and the word of the target language is calculated, and the single language A monolingual word similarity calculation means for outputting, for each language, the maximum monolingual word similarity that maximizes the inner word similarity;
Multiple languages that take an average value by excluding the maximum word similarity in a single language less than a preset first threshold for the plurality of maximum word similarity in a single language output by the word similarity calculation means in this single language An interword similarity calculation means;
When the average value calculated by the multilingual word similarity calculation means is equal to or greater than a preset second threshold, the word of the target language is determined as a content word, and when the average value is less than the second threshold, the target language Word determination means for determining the word as a function word;
A multilingual word classification device comprising:

A plurality of languages provided with a word list including a plurality of words and a target language in which content words and function words are not classified are composed of the same character string or include a predetermined number or more of the same characters. In this case, a computer provided with word list group storage means for storing a word list group composed of a plurality of the word lists in order to classify the words in the target language into content words and function words using the word list. The
The number and characters of characters that constitute the words included in each word list of the word list group stored in the word list group storage unit and the characters that constitute the input words of the target language appear. Based on the order, for a plurality of words included in the word list, a word similarity in a single language indicating a degree of similarity between the word included in the word list and the word of the target language is calculated, and the single language Monolingual word similarity calculation means for outputting, for each language, the maximum monolingual word similarity that maximizes the inner word similarity;
Multiple languages that take an average value by excluding the maximum word similarity in a single language less than a preset first threshold for the plurality of maximum word similarity in a single language output by the word similarity calculation means in this single language Interword similarity calculation means,
When the average value calculated by the multilingual word similarity calculation means is equal to or greater than a preset second threshold, the word of the target language is determined as a content word, and when the average value is less than the second threshold, the target language Word determination means for determining a word as a function word,
Multilingual word classification program characterized by functioning as