JP5356197B2

JP5356197B2 - Word semantic relation extraction device

Info

Publication number: JP5356197B2
Application number: JP2009273560A
Authority: JP
Inventors: 康嗣森本; 真岩山
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-12-01
Filing date: 2009-12-01
Publication date: 2013-12-04
Anticipated expiration: 2029-12-01
Also published as: JP2011118526A

Abstract

<P>PROBLEM TO BE SOLVED: To highly precisely extract a word semantic relation from text data by using an existing dictionary such as a synonym dictionary. <P>SOLUTION: A device for extraction of word semantic relation is configured to calculate various types of similarities as for arbitrary pairs of words in a text, and to generate identity vectors with each similarity as an element, and to give each pair of words a label indicating whether or not they are synonyms based on a synonym dictionary, and to learn a classifier from the identity vector and the label, and to identify whether or not the two words are synonyms by the learnt classifier. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、テキスト中から、単語間の意味的な関係を抽出する技術に関し、特に同義語、上位・下位語、兄弟語、対訳語などの単語意味関係を抽出する技術に関する。 The present invention relates to a technique for extracting a semantic relationship between words from a text, and more particularly to a technique for extracting a word semantic relationship such as a synonym, a higher / lower term, a sibling, a parallel translation, and the like.

パソコン及びインターネットの普及によって、ユーザがアクセス可能な電子化文書の量が増大している。このような大規模な文書情報の中から、所望の文書を効率的に発見するための技術の一つに文書検索技術がある。文書検索技術によれば、入力されたキーワードを含む文書を発見することで、ユーザが所望の文書を効率的に入手することができる。しかしながら、単純な文字列の検索だけでは不十分な場合も多い。未解決の問題の一つに同義語の問題がある。すなわち、同じ意味を表現する複数の単語が存在するために、同じ意味を表す文書が単純な文字列検索では発見できず、検索漏れが発生する場合がある。このような同義語の問題に対処するために、同義語辞書を検索システムに持たせることが従来から行われている。 With the spread of personal computers and the Internet, the amount of electronic documents accessible to users is increasing. One of techniques for efficiently finding a desired document from such large-scale document information is a document search technique. According to the document search technique, a user can efficiently obtain a desired document by finding a document including an input keyword. However, in many cases, a simple character string search is not sufficient. One problem that remains unsolved is the problem of synonyms. That is, since there are a plurality of words expressing the same meaning, a document expressing the same meaning cannot be found by a simple character string search, and a search omission may occur. In order to deal with such a problem of synonyms, it has been conventionally performed to provide a synonym dictionary in a search system.

同義語辞書の人手作成には大きなコストが必要であるため、同義語辞書をテキストデータから自動で作成することが従来から試みられている。同義語辞書を作成するための方法の一つとして、単語の出現文脈、すなわち着目している単語の近傍に現れる単語や文字列に着目する方法がある。非特許文献１に、出現文脈に基づく文脈ベース同義語抽出技術が開示されている。また、同義語の中で特に表記揺れを扱うための方法がある。非特許文献２に、発音に関する規則に基づいて、カタカナ表記の表記揺れを検出する表記ベース同義語抽出技術が開示されている。また、近年のＷｅｂ及びＷｅｂ文書のサーチエンジンの普及にともなって、サーチエンジンを利用した単語意味関係抽出技術が提案されている。サーチエンジンを利用するアプローチでは、事前に単語の出現文脈を計算することができない。そのため、検索式においてクエリをアンドで投入することで共起頻度を取得し、共起頻度に基づく統計量によって同義語を抽出する方式が提案されている。非特許文献３に、サーチエンジンに基づく共起ベース同義語抽出技術が開示されている。また、「ＡやＢなどのＣ」のような同義語、あるいは上位・下位語であることを明示的に示す同義語パターンを用いる同義語抽出技術も存在する。非特許文献４には、単語のパターンを用いることによるパターンベース同義語抽出技術が開示されている。また、単語間の意味関係の一つとして対訳関係がある。対訳関係は、同義語関係を多言語に拡張したものとみなすことができる。非特許文献５に、対訳関係を自動的に抽出する技術が開示されている。本技術は、文脈ベース同義語抽出技術を多言語に拡張したものである。 Since manual creation of a synonym dictionary requires a large cost, attempts have been made to automatically create a synonym dictionary from text data. One method for creating a synonym dictionary is to focus on the appearance context of words, that is, on words and character strings that appear in the vicinity of the word of interest. Non-Patent Document 1 discloses a context-based synonym extraction technique based on appearance context. There are also methods for dealing with notation fluctuations among synonyms. Non-Patent Document 2 discloses a notation-based synonym extraction technique for detecting katakana notation fluctuation based on pronunciation rules. With the recent spread of Web and Web document search engines, word semantic relationship extraction techniques using search engines have been proposed. In the approach using a search engine, the appearance context of words cannot be calculated in advance. Therefore, a method has been proposed in which a co-occurrence frequency is acquired by inputting a query with AND in a search expression, and a synonym is extracted by a statistic based on the co-occurrence frequency. Non-Patent Document 3 discloses a co-occurrence-based synonym extraction technique based on a search engine. There is also a synonym extraction technique that uses a synonym such as “C such as A or B” or a synonym pattern that explicitly indicates a broader or lowerer term. Non-Patent Document 4 discloses a pattern-based synonym extraction technique using a word pattern. Moreover, there is a bilingual relationship as one of the semantic relationships between words. The bilingual relationship can be regarded as an extension of the synonym relationship to multiple languages. Non-Patent Document 5 discloses a technique for automatically extracting a bilingual relationship. This technology is an extension of the context-based synonym extraction technology to multiple languages.

以上の同義語抽出技術は、教師なし学習、すなわち人手によって付与された正解を用いないタイプの学習技術によっている。教師なし学習では正解を作成する必要がないため、人手のコストが低いことが利点である。しかしながら、以下のような課題が存在する。 The above synonym extraction technology is based on unsupervised learning, that is, a type of learning technology that does not use a correct answer given manually. Since unsupervised learning does not require the creation of correct answers, it is an advantage that the cost of manpower is low. However, the following problems exist.

現在では人手で作成された大規模な辞書が広く利用可能となっている。既存の同義語辞書、シソーラス辞書、対訳辞書は、高いコストを掛けて整備してきた価値のある資源であり、可能な限り有効に活用する必要がある。教師なし学習による単語意味関係抽出技術では、このような人手作成辞書の存在を想定しておらず、また人手作成辞書が存在してもこれを利用して精度を向上することができない。 Currently, large dictionaries created manually are widely available. Existing synonym dictionaries, thesaurus dictionaries, and bilingual dictionaries are valuable resources that have been developed at high costs and need to be used as effectively as possible. The word semantic relationship extraction technique based on unsupervised learning does not assume the existence of such a manually created dictionary, and even if a manually created dictionary exists, it cannot be used to improve accuracy.

以上のような課題を解決する方法として、教師あり学習による同義語抽出方法が非特許文献６に開示されている。非特許文献６では、人手によって作成された同義語辞書を正解として、教師あり学習によって同義語抽出を行う。具体的には、後述する単語の文脈に基づいて単語の意味を表現し、正解である同義語辞書を用いることによって学習を行い、同義語を抽出する。 As a method for solving the above problems, Non-Patent Document 6 discloses a synonym extraction method by supervised learning. In Non-Patent Document 6, synonym extraction is performed by supervised learning using a synonym dictionary created manually as a correct answer. Specifically, the meaning of a word is expressed based on the context of the word, which will be described later, and learning is performed by using a synonym dictionary that is a correct answer, and synonyms are extracted.

相澤：「大規模テキストコーパスを用いた語の類似度計算に関する考察」情報処理学会論文誌，vol. 49-3, pp. 1426-1436 (2008).Aizawa: “Study on word similarity calculation using large-scale text corpus”, IPSJ Journal, vol. 49-3, pp. 1426-1436 (2008). 久保田他：カタカナ表記の統一方式予備分類とグラフ比較によるカタカナ表記のゆらぎ検出法，情報処理学会自然言語処理研究会報告，NL97-16,pp.111-117,1993.Kubota et al: Unification method of katakana notation Preliminary classification and fluctuation detection method of katakana notation by graph comparison, IPSJ Natural Language Processing Study Group Report, NL97-16, pp.111-117,1993. P. Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. ECML 2001, 491-502.P. Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. ECML 2001, 491-502. M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pp. 539-545, 1992.M. Hearst. Automatic acquisition of hyponyms from large text corpora.In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pp. 539-545, 1992. Hiroyuki Kaji and Toshiko Aizono, “Extracting word correspondences from bilingual corpora based on word co-occurrence information,” Proceedings of the 16th International Conference on Computational Linguistics, pp.23-28, 1996.Hiroyuki Kaji and Toshiko Aizono, “Extracting word correspondences from bilingual corpora based on word co-occurrence information,” Proceedings of the 16th International Conference on Computational Linguistics, pp.23-28, 1996. Masato Hagiwara: A Supervised Learning Approach to Automatic Synonym Identification based on Distributional Features, Proc. of ACL 2008 Student Research Workshop, pp. 1-6, 2008.Masato Hagiwara: A Supervised Learning Approach to Automatic Synonym Identification based on Distributional Features, Proc. Of ACL 2008 Student Research Workshop, pp. 1-6, 2008.

本発明の目的は、従来技術より高精度な単語意味関係抽出技術を実現することである。教師あり学習のアプローチでは、上記の課題が解決されている一方で、教師あり学習独自の課題が存在する。最大の課題は、教師なし学習の先行研究において蓄積されている知見が活用されていない点である。例えば、非特許文献６では、単語ペアと共起する任意の単語全てを素性として用いており、文脈の分布全体に関する類似度そのものを教師データから学習しようとしている。しかしながら、文脈の分布の類似度に関しては、非特許文献１に開示されているような様々な提案・改良が行われている。このような知見を取り込みつつ、教師あり学習を適用することが必要である。 An object of the present invention is to realize a word semantic relationship extraction technique with higher accuracy than the prior art. In the supervised learning approach, while the above problems are solved, there are problems unique to supervised learning. The biggest problem is that the knowledge accumulated in previous research on unsupervised learning is not utilized. For example, in Non-Patent Document 6, all arbitrary words co-occurring with a word pair are used as features, and the similarity itself regarding the entire context distribution is to be learned from the teacher data. However, various proposals and improvements as disclosed in Non-Patent Document 1 have been made regarding the similarity of the distribution of contexts. It is necessary to apply supervised learning while incorporating such knowledge.

また、非特許文献６では、構文解析結果を利用した文脈ベース類似度に基づく同義語抽出技術が開示されているが、教師なし学習による同義語抽出技術で数多く検討されてきた、様々なアプローチについては検討がなされていない。教師なし学習における過去のアプローチは、それぞれ長所・短所を備えている。例えば、非特許文献３に開示されている表記ベース方式は、カタカナの異表記語のような特定の種類の同義語しか抽出できない。非特許文献４に開示されているパターン方式は、任意のタイプの同義語を比較的高精度に抽出可能であるが、カバレジが低く、必要な同義語を全て抽出することが難しい。文脈ベース類似度は、抽出できる同義語のタイプに関してはオールマイティであり、広い範囲の同義語をカバーすることができるが、表記ベース、パターンベース方式に比べると適合率は低い。これらの方式を統合することが、精度向上には不可欠である。 Non-Patent Document 6 discloses a synonym extraction technique based on context-based similarity using the result of syntactic analysis, but various approaches that have been studied a lot in synonym extraction techniques based on unsupervised learning. Has not been studied. Past approaches to unsupervised learning have their own strengths and weaknesses. For example, the notation-based method disclosed in Non-Patent Document 3 can extract only specific types of synonyms such as Katakana variant notation. The pattern method disclosed in Non-Patent Document 4 can extract any type of synonym with relatively high accuracy, but has low coverage and it is difficult to extract all necessary synonyms. The context-based similarity is almighty with respect to the types of synonyms that can be extracted, and can cover a wide range of synonyms, but has a lower precision than the notation-based and pattern-based methods. Integration of these methods is essential for improving accuracy.

本発明は、以上の課題を解決するためになされたものであり、既存の同義語辞書、シソーラス辞書を活用すると同時に、複数のアプローチを統合し、かつ適切な閾値を設定可能である単語意味関係抽出方式を提供することを目的とする。 The present invention has been made in order to solve the above-described problems. The word semantic relationship is capable of integrating a plurality of approaches and setting an appropriate threshold while utilizing an existing synonym dictionary and thesaurus dictionary. The purpose is to provide an extraction method.

本発明の単語意味関係抽出装置は、テキストから抽出した単語の組に対してそれぞれ異なる複数種類の類似度を要素とする素性ベクトルを生成する手段と、既知の辞書を参照し、素性ベクトルに対して単語意味関係を示すラベルを付与する手段と、ラベルが付与された複数の素性ベクトルに基づいて単語意味関係判定ルールを学習する手段と、学習した単語意味関係判定ルールに基づいて、任意の単語の組に対して単語意味関係を判定する手段と、を備える。 The word semantic relationship extraction device of the present invention refers to a means for generating a feature vector having a plurality of types of similarities as elements for a set of words extracted from a text, and a known dictionary. A means for providing a label indicating a word semantic relationship, a means for learning a word semantic relationship determination rule based on a plurality of feature vectors to which the label is assigned, and an arbitrary word based on the learned word semantic relationship determination rule Means for determining a word semantic relationship with respect to the set.

単語意味関係の一例は、単語の組の２つの単語が同義語か否かの関係であり、このとき既知の辞書としては、見出し語とその同義語とを格納した同義語辞書を用いる。 An example of the word semantic relationship is a relationship as to whether or not two words in a set of words are synonyms. At this time, as a known dictionary, a synonym dictionary storing headwords and their synonyms is used.

単語意味関係の他の例は、単語の組の２つの単語が同義語であるか、上位・下位関係にあるか、兄弟語関係にあるか、あるいはそのいずれでもないかであり、このとき既知の辞書には、見出し語とその同義語、上位・下位語、あるいは兄弟語を格納したシソーラス辞書を用いる。 Other examples of word semantic relationships are whether two words in a set of words are synonyms, upper / lower relationships, sibling relationships, or neither, known at this time The dictionary uses a thesaurus that stores headwords and their synonyms, upper and lower terms, or siblings.

単語意味関係の別の例は、単語の組の２つの単語の対訳関係であり、このときには既知の辞書として、見出し語とその訳語とを格納した対訳辞書を用いる。 Another example of the word semantic relationship is a bilingual relationship between two words in a set of words. At this time, a bilingual dictionary storing headwords and their translations is used as a known dictionary.

本発明の単語意味関係抽出装置は、プロセッサ、メモリ及びインタフェースを備える計算機システムによって実現可能である。 The word meaning relationship extraction apparatus of the present invention can be realized by a computer system including a processor, a memory, and an interface.

素性ベクトルの要素となる単語の組の類似度は、種々の方法で求めることができる。一例としては、テキストから単語（処理対象単語）とその文脈となる単語（文脈単語）の組を抽出し、抽出した結果を集約して得られる文脈行列を用いて文脈ベース類似度を計算する方法である。他の例は、テキスト中の任意の単語の組の文字の重複度合いに基づいて文字重複度を計算し、それを基に単語の組の類似度を計算する方法である。あるいは、テキスト中の任意の単語の組の文字の類似度合いに基づいて単語の組の類似度を計算してもよい。更に別の例は、テキスト中の任意の単語の組について、同時に出現した頻度を示す共起頻度を抽出し、抽出した結果に基づいて共起類似度を計算する方法である。 The similarity of a set of words that are elements of a feature vector can be obtained by various methods. As an example, a method of calculating a context-based similarity using a context matrix obtained by extracting a set of a word (processing target word) and a context word (context word) from a text and aggregating the extracted results It is. Another example is a method of calculating the character duplication degree based on the character duplication degree of an arbitrary word set in the text and calculating the similarity of the word set based on the character duplication degree. Or you may calculate the similarity of a word set based on the similarity of the character of the arbitrary word set in a text. Yet another example is a method of extracting a co-occurrence frequency indicating the frequency of simultaneous appearance for an arbitrary set of words in a text and calculating a co-occurrence similarity based on the extracted result.

本発明の代表的な形態によれば、人手作成による同義語辞書・シソーラス辞書・対訳辞書などの付加的な情報源を教師データとして用いると同時に、複数アプローチによって得られる異なるタイプの類似度を統合することにより、従来と比較して高精度な単語意味関係抽出を行うことが可能となる。 According to a typical embodiment of the present invention, additional information sources such as manually created synonym dictionaries, thesaurus dictionaries, bilingual dictionaries are used as teacher data, and at the same time, different types of similarities obtained by multiple approaches are integrated. By doing so, it is possible to perform word semantic relationship extraction with higher accuracy than in the past.

本発明による計算機システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the computer system by this invention. 単語意味関係抽出プログラム、辞書、各種テーブルやファイルの間の関係を示した図である。It is the figure which showed the relationship between a word meaning relationship extraction program, a dictionary, various tables, and a file. 本発明の計算機システムにおける処理の流れを示したシーケンス図である。It is the sequence diagram which showed the flow of the process in the computer system of this invention. 類似度行列の説明図である。It is explanatory drawing of a similarity matrix. 単語意味関係抽出処理のフローチャートである。It is a flowchart of a word meaning relationship extraction process. 同義語辞書の説明図である。It is explanatory drawing of a synonym dictionary. シソーラス辞書の説明図である。It is explanatory drawing of a thesaurus dictionary. 同義語識別の概念的な説明図である。It is a conceptual explanatory drawing of synonym identification. ユーザに提示される画面の説明図である。It is explanatory drawing of the screen shown to a user. 文脈行列の説明図である。It is explanatory drawing of a context matrix. 文脈行列の説明図である。It is explanatory drawing of a context matrix. 文脈抽出処理のフローチャートである。It is a flowchart of a context extraction process. 形態素解析結果の説明図である。It is explanatory drawing of a morphological analysis result. 文脈パターンの説明図である。It is explanatory drawing of a context pattern. 文字重複度計算処理のフローチャートである。It is a flowchart of a character duplication degree calculation process. 文字類似度計算処理のフローチャートである。It is a flowchart of a character similarity calculation process. 文字類似度テーブルの説明図である。It is explanatory drawing of a character similarity table. 共起頻度テーブルの説明図である。It is explanatory drawing of a co-occurrence frequency table. 単語頻度テーブルの説明図である。It is explanatory drawing of a word frequency table. 共起類似度テーブルの説明図である。It is explanatory drawing of a co-occurrence similarity table. 本発明の単語意味関係抽出装置の効果を示す実験結果の説明図である。It is explanatory drawing of the experimental result which shows the effect of the word meaning relationship extraction apparatus of this invention. 類似度行列の説明図である。It is explanatory drawing of a similarity matrix. ユーザに提示される画面の説明図である。It is explanatory drawing of the screen shown to a user. 対訳辞書の説明図である。It is explanatory drawing of a bilingual dictionary. 類似度行列の説明図である。It is explanatory drawing of a similarity matrix. 文脈行列の説明図である。It is explanatory drawing of a context matrix. 文脈行列の説明図である。It is explanatory drawing of a context matrix.

以下、図面を参照して本発明の実施の形態を説明する。
［第１の実施の形態］
第１の実施の形態として、単語意味関係として同義語関係にある単語ペアを抽出するための同義語抽出装置について説明する。図１は、本発明を実現する計算機システムの構成例を示すブロック図である。図１に示した計算機システムは、本発明の第１の実施の形態に用いられると共に、本発明の第２及び第３の実施の形態にも共通して用いられる。なお、実施の形態によっては使用されない機能も含んでいる。 Embodiments of the present invention will be described below with reference to the drawings.
[First Embodiment]
As a first embodiment, a synonym extraction device for extracting a word pair having a synonym relationship as a word semantic relationship will be described. FIG. 1 is a block diagram illustrating a configuration example of a computer system that implements the present invention. The computer system shown in FIG. 1 is used in the first embodiment of the present invention and is commonly used in the second and third embodiments of the present invention. Note that functions that are not used in some embodiments are also included.

単語意味関係抽出装置１００は、ＣＰＵ１０１、主メモリ１０２、入出力装置１０３及びディスク装置１１０を備える。ＣＰＵ１０１は、主メモリ１０２に記憶されるプログラムを実行することによって各種処理を行う。具体的には、ＣＰＵ１０１は、ディスク装置１１０に記憶されるプログラムを、主メモリ１０２上に呼び出して実行する。主メモリ１０２は、ＣＰＵ１０１によって実行されるプログラム及びＣＰＵ１０１によって必要とされる情報等を記憶する。入出力装置１０３には、ユーザから情報が入力される。また、入出力装置１０３は、ＣＰＵ１０１の指示に応じて、情報を出力する。例えば、入出力装置１０３は、キーボード、マウス及びディスプレイのうち少なくとも一つを含む。 The word meaning relationship extraction device 100 includes a CPU 101, a main memory 102, an input / output device 103, and a disk device 110. The CPU 101 performs various processes by executing programs stored in the main memory 102. Specifically, the CPU 101 calls a program stored in the disk device 110 on the main memory 102 and executes it. The main memory 102 stores programs executed by the CPU 101, information required by the CPU 101, and the like. Information is input to the input / output device 103 from the user. The input / output device 103 outputs information in response to an instruction from the CPU 101. For example, the input / output device 103 includes at least one of a keyboard, a mouse, and a display.

ディスク装置１１０は、各種情報を記憶する。具体的には、ディスク装置１１０は、ＯＳ１１１、単語意味関係抽出プログラム１１２、テキスト１１３、人手作成辞書１１４、類似度行列１１５、文脈行列１１６、品詞パターン１１７、共起類似度テーブル１１８、識別モデル１１９、文字類似度テーブル１２０を記憶する。 The disk device 110 stores various information. Specifically, the disk device 110 includes an OS 111, a word semantic relationship extraction program 112, a text 113, a manual creation dictionary 114, a similarity matrix 115, a context matrix 116, a part of speech pattern 117, a co-occurrence similarity table 118, and an identification model 119. The character similarity table 120 is stored.

ＯＳ１１１は、単語意味関係抽出装置１００の処理の全体を制御する。人手作成辞書１１４は、人手によって作成された各種辞書であり、同義語辞書１１４１、シソーラス辞書１１４２、対訳辞書１１４３を含む。同義語辞書１１４は、人手によって作成された同義語が格納された辞書である。シソーラス辞書１１５は、人手によって作成された同義語及び上位・下位語が格納された辞書である。 The OS 111 controls the entire processing of the word meaning relationship extraction apparatus 100. The manually created dictionary 114 is a variety of manually created dictionaries, and includes a synonym dictionary 1141, a thesaurus dictionary 1142, and a bilingual dictionary 1143. The synonym dictionary 114 is a dictionary in which synonyms created manually are stored. The thesaurus dictionary 115 is a dictionary in which synonyms and broader / lowerer words created manually are stored.

単語意味関係抽出プログラム１１２は、テキスト１１３及び同義語辞書１１４１あるいはシソーラス辞書１１４２から単語意味関係を抽出するプログラムであり、素性ベクトル抽出サブプログラム１１２１、正解ラベル設定サブプログラム１１２２、識別モデル学習サブプログラム１１２３、識別モデル適用サブプログラム１１２４からなる。 The word meaning relationship extraction program 112 is a program that extracts word meaning relationships from the text 113 and the synonym dictionary 1141 or thesaurus dictionary 1142. The feature vector extraction subprogram 1121, the correct label setting subprogram 1122, and the identification model learning subprogram 1123. And an identification model application subprogram 1124.

テキスト１１３は、単語意味関係抽出プログラム１１２への入力となるテキストであり、特別な形式である必要はない。ＨＴＭＬ文書、ＸＭＬ文書等のタグを含む文書の場合は、タグを除去する前処理を施すことが望ましいが、タグが含まれた状態でも処理は可能である。 The text 113 is text to be input to the word meaning relationship extraction program 112 and does not need to be in a special format. In the case of a document including a tag, such as an HTML document or an XML document, it is desirable to perform preprocessing to remove the tag, but the processing is possible even in a state where the tag is included.

類似度行列１１５は、テキスト及び同義語辞書から抽出された単語ペアに関する素性ベクトル、同義語かどうかを示すラベル等を格納した行列である。文脈行列１１６は、文脈ベース類似度を計算するために必要な単語の文脈情報を格納した行列である。品詞パターン１１７は、文脈ベース類似度を計算するために必要な単語の文脈情報をテキストから抽出するために用いられるデータである。共起類似度テーブル１１８は、単語の共起に基づいて計算された共起ベース類似度を格納したテーブルである。識別モデル１１９は、類似度行列から学習された、単語ペアが同義語であるかどうかを識別するためのモデルである。文字類似度テーブル１２０は、意味が類似した文字間の関係を格納するテーブルである。 The similarity matrix 115 is a matrix that stores a feature vector related to a word pair extracted from a text and a synonym dictionary, a label indicating whether or not it is a synonym, and the like. The context matrix 116 is a matrix that stores context information of words necessary for calculating context-based similarity. The part-of-speech pattern 117 is data used to extract context information of words necessary for calculating the context-based similarity from the text. The co-occurrence similarity table 118 is a table that stores co-occurrence-based similarity calculated based on word co-occurrence. The identification model 119 is a model for identifying whether a word pair is a synonym learned from a similarity matrix. The character similarity table 120 is a table that stores relationships between characters having similar meanings.

図２Ａは、図１に示した単語意味関係抽出プログラム、辞書、各種テーブルやファイルの間の関係を示した図である。素性ベクトル抽出サブプログラム１１２１は、テキスト１１３を読み込んでテキスト中の全ての単語を抽出し、任意の単語の組に対して各種の類似度を計算し、類似度行列１１５として出力する。その際に必要な情報である文脈行列１１６、共起類似度テーブル１１８等の情報を事前に作成しておく。なお、第１の実施の形態では、テキストは同一の言語の文書、例えば日本語の文書からなることを想定している。ただし、一部に英語の文書が含まれていたとしても、無駄な処理が発生する以外の問題はない。品詞パターン１１７は、文脈行列１１６の作成に用いられる。正解ラベル設定サブプログラム１１２２は、同義語辞書１１４１やシソーラス辞書１１４２、対訳辞書１１４３を正解データとして読み込み、類似度行列１１５中の各単語ペアに正解、すなわち同義語であるかどうかを示すラベルを設定する。識別モデル学習サブプログラム１１２３は、類似度行列１１５を読み込み、単語ペアが同義語かどうかを識別するための識別モデル１１９を学習する。識別モデル適用サブプログラム１１２４は、識別モデル１１９を読み込み、類似度行列１１５中の単語ペアに対し、同義語かどうかの判定結果を付与する。 FIG. 2A is a diagram illustrating relationships among the word meaning relationship extraction program, the dictionary, various tables, and files illustrated in FIG. The feature vector extraction subprogram 1121 reads the text 113, extracts all the words in the text, calculates various similarities for an arbitrary set of words, and outputs it as a similarity matrix 115. Information such as a context matrix 116 and a co-occurrence similarity table 118, which are necessary information, are created in advance. In the first embodiment, it is assumed that the text is composed of documents in the same language, for example, Japanese documents. However, even if some English documents are included, there is no problem other than unnecessary processing. The part-of-speech pattern 117 is used to create the context matrix 116. The correct answer label setting subprogram 1122 reads the synonym dictionary 1141, thesaurus dictionary 1142, and the bilingual dictionary 1143 as correct answer data, and sets a label indicating whether each word pair in the similarity matrix 115 is a correct answer, that is, a synonym. To do. The identification model learning subprogram 1123 reads the similarity matrix 115 and learns an identification model 119 for identifying whether a word pair is a synonym. The identification model application subprogram 1124 reads the identification model 119 and gives a determination result as to whether or not the word pair in the similarity matrix 115 is a synonym.

図２Ｂは、本発明の計算機システムにおける処理の流れを示したシーケンス図である。まずＯＳがディスク装置から主メモリにロードされ、ユーザの入力等を待つ状態になる。ユーザによる単語意味関係抽出プログラムの実行の指示によって処理が開始される。まず、素性ベクトル抽出サブプログラムが主メモリにロードされ、素性ベクトル抽出サブプログラムは、テキストを読み込んでテキスト中の全ての単語を抽出し、品詞パターンを用いて、文脈行列作成を作成する。次に、形態素解析結果から得られる単語と人手作成辞書によって文字類似度テーブルを作成する。次に、形態素解析結果から共起類似度テーブルを作成する。そして、各種類似度からなる類似度行列を作成する。なお、第１の実施例では、テキストは同一の言語の文書、例えば日本語の文書からなることを想定している。ただし、一部に英語の文書が含まれていたとしても、無駄な処理が発生する以外の問題はない。 FIG. 2B is a sequence diagram showing the flow of processing in the computer system of the present invention. First, the OS is loaded from the disk device to the main memory and waits for user input. The process is started by an instruction to execute the word meaning relationship extraction program by the user. First, a feature vector extraction subprogram is loaded into the main memory, and the feature vector extraction subprogram reads text, extracts all words in the text, and creates a context matrix using the part-of-speech pattern. Next, a character similarity table is created from words obtained from morphological analysis results and a manual creation dictionary. Next, a co-occurrence similarity table is created from the morphological analysis results. Then, a similarity matrix composed of various similarities is created. In the first embodiment, it is assumed that the text is a document in the same language, for example, a Japanese document. However, even if some English documents are included, there is no problem other than unnecessary processing.

正解ラベル設定サブプログラムは、人手作成辞書を正解データとして読み込み、類似度行列中の各単語ペアに正解、すなわち同義語であるかどうかを示すラベルを設定する。識別モデル学習サブプログラムは、類似度行列を読み込み、単語ペアかどうかを識別するための識別モデルを学習する。識別モデル適用サブプログラムは、識別モデルを読み込み、類似度行列中の単語ペアに対し、同義語かどうかの判定結果を付与する。 The correct answer label setting subprogram reads the manually created dictionary as correct answer data, and sets a label indicating whether each word pair in the similarity matrix is a correct answer, that is, a synonym. The identification model learning subprogram reads the similarity matrix and learns an identification model for identifying whether or not it is a word pair. The identification model application subprogram reads the identification model and gives a determination result as to whether or not the word pair in the similarity matrix is a synonym.

以下では、図３に示す類似度行列の例を用いて本発明の基本的な考え方を説明する。
テキストデータ中に含まれる、任意の単語のペアを考える。例えば、単語のペアを＜計算機，コンピュータ＞とする。このとき、単語ペアが同義語であるかどうかを判定するための様々な尺度を想定することができる。 Hereinafter, the basic concept of the present invention will be described using the example of the similarity matrix shown in FIG.
Consider an arbitrary word pair included in text data. For example, the word pair is <computer, computer>. At this time, various scales for determining whether a word pair is a synonym can be assumed.

例えば、非特許文献１に開示されているような、単語の出現文脈間の類似度（以下、文脈ベース類似度と呼ぶ）を用いる方法がある。また、非特許文献２に開示されているような、重複する文字数に着目するなど表記に基づいた類似度（以下、表記ベース類似度と呼ぶ）が考えられる。さらに、非特許文献３に開示されているような、単語ペアが共起する頻度に基づく類似度（以下、共起ベース類似度と呼ぶ）用いることも可能である。さらに、各手法において、様々なバリエーションが存在する。例えば、文脈ベース類似度において、単語の出現文脈をどのように定義するか、あるいは距離の計算方法をどのように定義するかによってバリエーションが存在する。また、共起ベース類似度においても、共起頻度から計算される類似度として、相互情報量、Dice係数などの異なる統計量を用いることが可能である。本発明では、このような様々な類似度を、単語ペアの素性であると考え、単語ペアを素性毎の値からなる素性ベクトルで表現する。図３の例では、例えば、＜コンピュータ，コンピューター＞という単語ペアは、素性１の次元の値が０．３、素性２の次元の値が０．２、素性Ｎの次元の値が０．８であるベクトルで表現されている。 For example, as disclosed in Non-Patent Document 1, there is a method using the similarity between appearance contexts of words (hereinafter referred to as context-based similarity). Further, similarity based on notation such as focusing on the number of overlapping characters as disclosed in Non-Patent Document 2 (hereinafter referred to as notation-based similarity) can be considered. Furthermore, as disclosed in Non-Patent Document 3, it is possible to use a similarity based on the frequency with which word pairs co-occur (hereinafter referred to as a co-occurrence-based similarity). Furthermore, there are various variations in each method. For example, in context-based similarity, there are variations depending on how the word appearance context is defined or how the distance calculation method is defined. Also, in the co-occurrence-based similarity, different statistics such as mutual information and Dice coefficient can be used as the similarity calculated from the co-occurrence frequency. In the present invention, such various similarities are considered to be the feature of the word pair, and the word pair is represented by a feature vector composed of values for each feature. In the example of FIG. 3, for example, the word pair <computer, computer> has a feature 1 dimension value of 0.3, a feature 2 dimension value of 0.2, and a feature N dimension value of 0.8. It is expressed as a vector.

さらに、この単語ペアが同義語であるかどうかを、同義語辞書やシソーラス辞書等の人手作成辞書を用いて判断し、ラベル付けを行う。すなわち、＜計算機、コンピュータ＞が同義語辞書に含まれていれば、＜計算機、コンピュータ＞は正解であるというラベルを付与する。正解を表す行、すなわち単語ペアを正例と呼ぶ。図３の例では、＜計算機、コンピュータ＞、＜コンピュータ，コンピューター＞が同義語であるため、ラベルとして正解を表す「１」が付与されている。もし、単語ペアが同義語辞書に含まれていない場合には、不正解であるというラベルを付与する。不正解を表す行を負例と呼ぶ。図３の例では、＜プログラム、コンピュータ＞が同義語でないため、ラベルとして不正解を表す「−１」が付与されている。このように、単語ペアを素性の値のベクトルで表現し、さらに正解データを付与することにより、サポートベクターマシンのような教師あり学習による分類器を適用することが可能となる。以上が本発明の基本的な考え方である。 Further, it is determined whether or not the word pair is a synonym using a manually created dictionary such as a synonym dictionary or a thesaurus dictionary, and labeling is performed. That is, if <computer, computer> is included in the synonym dictionary, the label <computer, computer> is given as a correct answer. A line representing a correct answer, that is, a word pair is called a positive example. In the example of FIG. 3, since <computer, computer> and <computer, computer> are synonyms, “1” representing the correct answer is given as a label. If the word pair is not included in the synonym dictionary, a label of incorrect answer is given. A line representing an incorrect answer is called a negative example. In the example of FIG. 3, since <program, computer> is not a synonym, “−1” representing an incorrect answer is given as a label. In this way, by expressing a word pair with a vector of feature values and adding correct data, a classifier based on supervised learning such as a support vector machine can be applied. The above is the basic idea of the present invention.

ここで、ラベルを付与する際に、単語ペアが人手作成辞書に含まれていない場合には、注意が必要である。人手による辞書は完全ではないため、同義語辞書に含まれていない場合でも、同義語である場合が存在する。この問題への対応方法については後述する。 Here, when giving a label, caution is required if the word pair is not included in the manually created dictionary. Since manual dictionaries are not perfect, there are cases where they are synonyms even if they are not included in the synonym dictionary. A method for dealing with this problem will be described later.

図４は、本発明の第１の実施の形態の同義語抽出装置によって実行される単語意味関係抽出処理のフローチャートである。 FIG. 4 is a flowchart of word semantic relationship extraction processing executed by the synonym extraction device according to the first embodiment of this invention.

ステップ１１において、全ての単語ペアの処理を終了したかどうか判定する。終了していたら、ステップ１７に進む。処理していない単語ペアが存在すれば、ステップ１２に進む。ステップ１２では、全ての種類の素性について処理を終了したかどうかを判定する。終了していたらステップ１６に進む。処理していない素性が存在すれば、ステップ１３に進む。 In step 11, it is determined whether or not all word pairs have been processed. If completed, go to Step 17. If there is an unprocessed word pair, the process proceeds to step 12. In step 12, it is determined whether or not the processing has been completed for all types of features. If completed, go to step 16. If there is an unprocessed feature, the process proceeds to step 13.

ステップ１３では、ｉ番目の単語ペアを取得する。単語ペアの取得は、例えば、テキストを形態素解析して全単語リストを予め作成しておき、その中から任意の２個の単語の組み合わせを取得すれば良い。ステップ１４では、取得したｉ番目の単語ペアについて、ｊ番目の素性の計算を行う。ステップ１４の処理の詳細は後述する。次に、ステップ１５に進み、素性の計算結果を類似度行列に格納する。類似度行列の例は、図３で説明した通りである。 In step 13, the i-th word pair is acquired. For example, word pairs can be acquired by, for example, preparing a whole word list by morphological analysis of text and acquiring a combination of two arbitrary words from the list. In step 14, the j-th feature is calculated for the acquired i-th word pair. Details of the processing in step 14 will be described later. Next, the process proceeds to step 15, and the feature calculation result is stored in the similarity matrix. An example of the similarity matrix is as described in FIG.

ステップ１６では、類似度行列にラベルを設定する。ラベルは同義語辞書、あるいはシソーラス辞書を参照することによって設定する。第１の実施の形態では、同一言語の文書を想定しているため、通常では対訳辞書を用いないが、技術文書の場合には、日本語文書の中に英単語が含まれる場合も存在する。このような場合に対応するため、対訳辞書を使用しても良い。 In step 16, a label is set in the similarity matrix. The label is set by referring to the synonym dictionary or thesaurus dictionary. In the first embodiment, since a document in the same language is assumed, a bilingual dictionary is not normally used. However, in the case of a technical document, there is a case where an English word is included in a Japanese document. . In order to cope with such a case, a bilingual dictionary may be used.

同義語辞書の例を図５に、シソーラス辞書の例を図６に示す。同義語辞書は、同義語である単語ペアに対し、一方を見出し語欄、他方を同義語欄に格納したデータである。辞書引きの都合上、冗長にデータを保持しているものとする。すなわち、＜コンピュータ、コンピューター＞という同義ペアに対し、「コンピュータ」を見出し語とした行と「コンピューター」を見出し語とした行の両方を保持しているものとする。これにより、見出し語欄のみを確認することで全ての同義語ペアを取得することができる。 An example of the synonym dictionary is shown in FIG. 5, and an example of the thesaurus dictionary is shown in FIG. The synonym dictionary is data in which one is stored in the headword column and the other is stored in the synonym column for word pairs that are synonyms. For the sake of dictionary lookup, it is assumed that data is retained redundantly. That is, for the synonymous pair <computer, computer>, it is assumed that both a line having “computer” as a headword and a line having “computer” as a headword are held. Thereby, all synonym pairs can be acquired by confirming only the headword column.

シソーラス辞書は、同義語である単語ペア、及び上位・下位語関係にある単語ペアに対し、一方を見出し語欄、他方を関連語欄に格納し、タイプ欄に見出し語に対する関連語のタイプを格納したデータである。例えば、図６の例の場合、＜コンピュータ、機器＞のような上位・下位語関係にある単語ペアに対し、「コンピュータ」が見出し、「機器」が関連語であり、「機器」が「コンピュータ」の「上位語」（より抽象的な語）であることが格納されている。シソーラス辞書についても辞書引きの都合上、冗長にデータを保持しているものとする。すなわち、＜コンピューター、機器＞という単語ペアに対し、「コンピューター」を見出し語とした行と、「機器」を見出し語とした行の両方を保持しているものとする。ここで、特に単語ペアが上位・下位語関係にある場合には、順序を逆にしたペアのタイプは同様に逆になることに注意が必要である。例えば、「コンピュータ」は「機器」の下位語となる。 The thesaurus dictionary stores one word in the headword column and the other in the related word column for word pairs that are synonyms and upper / lower terms, and the type column indicates the type of related word for the headword. Stored data. For example, in the case of the example of FIG. 6, “computer” is found, “device” is a related word, and “device” is “computer” for a word pair having a higher / lower term relationship such as <computer, device>. "Is a broader term" (a more abstract word). It is assumed that the thesaurus dictionary also holds data redundantly for the sake of dictionary lookup. That is, for the word pair <computer, device>, it is assumed that both a row having “computer” as a headword and a row having “device” as a headword are held. Here, it is necessary to note that the type of the pair whose order is reversed is similarly reversed particularly when the word pair is in the upper / lower word relationship. For example, “computer” is a subordinate term of “device”.

類似度行列へのラベルの設定において、単語ペアが同義語辞書のある行と一致している、すなわち同義語である場合には、正解のラベルとして「１」を付与する。それ以外の場合は、以下のように処理する。単語ペアが同義語ではない、すなわち同義語辞書中でこの単語ペアを含む行はないが、単語それぞれは同義語辞書の別の行に含まれている場合には、不正解のラベルとして「−１」を付与する。単語の組の少なくとも一方の単語が同義語辞書に含まれていない場合には、不明のラベルとして「０」を付与する。 When setting a label in the similarity matrix, if the word pair matches a certain line in the synonym dictionary, that is, is a synonym, “1” is assigned as the correct label. Otherwise, process as follows. If the word pair is not a synonym, that is, there is no line containing the word pair in the synonym dictionary, but each word is included in another line of the synonym dictionary, the label “−” 1 ”is given. If at least one word of the word set is not included in the synonym dictionary, “0” is assigned as an unknown label.

図３の例の場合、＜コンピュータ，コンピューター＞及び＜計算機，コンピュータ＞は同義語であることから、ラベルとして「１」が付与される。また、＜プログラム，コンピュータ＞は同義語ではない、すなわち「プログラム」と「コンピュータ」それぞれは、同義語辞書中に含まれるが、両方を含む行が存在しないという想定のもと、ラベルとして「−１」が付与される。また、＜計算機，仮想化技術＞については、「仮想化技術」が同義語辞書に含まれなかったという想定のもと、ラベルとして「０」が付与される。シソーラス辞書を参照する場合には、タイプ欄を参照し、タイプが同義語である行のみを対象に同様の処理を行う。 In the example of FIG. 3, since <computer, computer> and <computer, computer> are synonyms, “1” is given as a label. Also, <program, computer> is not a synonym, that is, “program” and “computer” are included in the synonym dictionary, but there is no line that includes both. 1 "is given. For <computer, virtualization technology>, “0” is assigned as a label on the assumption that “virtualization technology” is not included in the synonym dictionary. When referring to the thesaurus dictionary, the type column is referred to, and the same processing is performed only for the line whose type is a synonym.

図４に戻り、ステップ１７では識別モデルを学習する。類似度行列中から、ラベルが「正解」あるいは「不正解」である行のみを対象に、２値の識別モデルを学習する。識別モデルとしては、任意のモデルを使用することができるが、例えば、C.J.C.Burges, “A Tutorial on Support Vector Machines for Pattern Recognition” Data Mining and Knowledge Discovery, vol.2, pp.121-168 (1998).に開示されているサポートベクターマシンを用いることができる。 Returning to FIG. 4, in step 17, an identification model is learned. From the similarity matrix, a binary identification model is learned only for the rows whose labels are “correct” or “incorrect”. As an identification model, any model can be used. For example, CJCBurges, “A Tutorial on Support Vector Machines for Pattern Recognition” Data Mining and Knowledge Discovery, vol.2, pp.121-168 (1998) The support vector machine disclosed in can be used.

図７に、同義語識別の概念図を示す。各単語ペアの素性ベクトルは、素性１〜Ｎで表現されるＮ次元空間上のある点に相当し、図７では黒塗りの四角で表現されている。このとき、同義語である単語ペアが配置されている領域と同義語ではない単語ペアが配置されている領域の境界を発見することが識別モデルの学習である。未知の点、すなわち同義語であるかどうかが不明である単語ペアが与えられたとき、いずれの領域に所属するかによって同義語であるかどうかを判定することが識別モデルの適用である。サポートベクターマシンは、非線形の識別モデル、すなわち境界として、直線、平面、超平面（４次元以上の空間での平面）以外を使用できる点が特徴である。 FIG. 7 shows a conceptual diagram of synonym identification. The feature vector of each word pair corresponds to a certain point on the N-dimensional space represented by the features 1 to N, and is represented by a black square in FIG. At this time, the learning of the identification model is to find a boundary between a region where a word pair which is a synonym is arranged and a region where a word pair which is not a synonym is arranged. When a word pair whose unknown point, that is, whether it is a synonym, is given, it is an application of the identification model to determine whether it is a synonym depending on which region it belongs to. The support vector machine is characterized in that a non-linear identification model, that is, a boundary other than a straight line, a plane, or a hyperplane (a plane in a space of four or more dimensions) can be used.

ステップ１８では、モデルに従って、類似度行列の値から単語意味関係抽出を行う。行列中の全ての単語ペアについて、素性ベクトルを学習済みの識別器に入力し、同義語であるかどうかを識別する。識別器の判定結果は、類似度行列の判定結果欄に格納する。これにより、ラベルが「不明」すなわち「０」であった単語ペアに対し、同義語であるかどうかの判定が行われる。また、人手による同義語辞書の誤りチェックに使用することもできる。既に「不明」以外のラベルが付与されている単語ペアに対し、ラベルと判定結果が異なるもののみを抽出し、人手によって確認することにより同義語辞書を効率的にチェックすることができる。 In step 18, word semantic relationship extraction is performed from the value of the similarity matrix according to the model. For every word pair in the matrix, the feature vector is input to the learned classifier to identify whether it is a synonym. The determination result of the discriminator is stored in the determination result column of the similarity matrix. Thereby, it is determined whether or not the word pair whose label is “unknown”, that is, “0”, is a synonym. It can also be used for error checking of a synonym dictionary manually. For a word pair to which a label other than “unknown” has already been assigned, only those having a different label and determination result are extracted and checked manually to check the synonym dictionary efficiently.

図８に、同義語辞書エディタの画面例を示す。ラベルが同義語であるが、判定結果は同義語ではない単語ペアが画面上部に表示されており、人手のチェック結果によってラベルが変更される。同様に、ラベルは同義語ではないが、判定結果では同義語である単語ペアが画面下部に表示されており、人手のチェック結果によってラベルが変更される。このようなエディタにより、同義語辞書のチェックを行うことができる。もちろん、同義語辞書中のデータは正解であることを前提に、「不明」の単語ペアのみを対象とすることもできる。 FIG. 8 shows a screen example of the synonym dictionary editor. Although the label is a synonym but the determination result is a word pair that is not a synonym is displayed at the top of the screen, the label is changed depending on the result of the manual check. Similarly, although a label is not a synonym, a word pair that is a synonym is displayed in the lower part of the screen in the determination result, and the label is changed according to a human check result. With such an editor, the synonym dictionary can be checked. Of course, on the assumption that the data in the synonym dictionary is correct, it is also possible to target only “unknown” word pairs.

以下では、図４のステップ１４の処理を詳細に説明する。ステップ１４では、単語ペアを表現するための素性として、各種の類似度を計算する。以下、類似度のタイプ毎に説明を行う。 Hereinafter, the process of step 14 in FIG. 4 will be described in detail. In step 14, various similarities are calculated as features for expressing word pairs. Hereinafter, description will be made for each type of similarity.

（１）文脈ベース類似度
以下では、文脈ベース類似度を計算する方法について説明する。ある単語の文脈とは、その単語がテキスト中に出現している箇所の「近傍」の単語、あるいは単語列等を示す。何をもって「近傍」と定義するかによって、様々な文脈が定義できる。以下では、文脈として、後続する動詞及び直前に出現する形容詞・形容動詞を出現文脈として用いる例を説明するが、これ以外の出現文脈を代替して使用する、あるいは追加・組み合わせて使用することも可能である。また、文脈同士の類似度計算式にも様々な方法が存在する。 (1) Context-Based Similarity Hereinafter, a method for calculating the context-based similarity will be described. The context of a word indicates a word “near” or a word string at a location where the word appears in the text. Various contexts can be defined depending on what is defined as “neighbor”. In the following, an example is described in which the following verb and the immediately preceding adjective / adjective verb are used as the context as the context, but other occurrence contexts may be used instead, or added / combined. Is possible. There are also various methods for calculating the similarity between contexts.

文脈ベース類似度は、文脈行列に基づいて計算される。図９に文脈行列の一例を示す。文脈行列は、見出し欄と文脈情報欄からなり、見出し欄中の単語に対し、文脈単語列とその頻度の組の繰り返しからなる文脈情報が格納されている。図９の例は、着目した単語に後続する助詞＋述語を文脈とした場合を示す。例えば、「コンピュータ」には、「が起動する」が１５回、「を接続する」が４回出現していることを示している。このような文脈行列に対し、任意の２個の単語に相当する行の文脈情報を取得し、文脈単語列の頻度ベクトルに基づいて類似度を計算する。文脈ベース類似度としては、タームベクトルモデルによる文書検索に用いられている方法を用いることができ、例えば、北、津田、獅々掘「情報検索アルゴリズム」共立出版（２００２年）に開示されている方法を用いることができる。本実施の形態では、一例として下式の類似度計算方法によって類似度ｓを計算する。 The context-based similarity is calculated based on the context matrix. FIG. 9 shows an example of the context matrix. The context matrix includes a heading field and a context information field, and stores context information including a repetition of a combination of a context word string and its frequency for words in the heading field. The example of FIG. 9 shows the case where the particle + predicate following the focused word is used as the context. For example, in “Computer”, “Start up” appears 15 times and “Connect” appears four times. For such a context matrix, context information of a row corresponding to any two words is acquired, and the similarity is calculated based on the frequency vector of the context word string. As the context-based similarity, a method used for document search by a term vector model can be used, and is disclosed in, for example, Kita, Tsuda, and Tsurugi-min "Information Search Algorithm" Kyoritsu Publishing (2002). The method can be used. In this embodiment, as an example, the similarity s is calculated by the similarity calculation method of the following equation.

また、式中のパラメータの説明は、文書検索に適用する場合の説明であり、同義語抽出の場合には、入力文書を同義語抽出の対象入力単語、ターゲット文書を同義語候補単語、入力文書中の単語を入力単語の文脈単語にそれぞれ読み替える。 In addition, the description of the parameters in the formula is an explanation when applied to document search. In the case of synonym extraction, the input document is the target input word for synonym extraction, the target document is the synonym candidate word, and the input document. The word inside is replaced with the context word of the input word.

どのような単語を文脈として抽出するかについては、様々なバリエーションが存在する。例えば、「コンピュータ」の文脈として、「高速なコンピュータ」のような表現から「高速な」を抽出することもできるし、「計算（する）」の文脈として、「平均値を計算（する）」のような表現から、「平均値を」を抽出することもできる。このような様々なバリエーションの文脈をまとめて扱っても良いし、各文脈をそれぞれ別素性として扱っても良い。本実施の形態では、２種類の異なるタイプの文脈を、別素性として扱う例について説明する。図９とは異なるタイプの文脈として、着目する単語の前に出現する形容詞、形容動詞を抽出した結果の例を図１０に示す。 There are various variations on what words are extracted as contexts. For example, “fast” can be extracted from an expression such as “fast computer” as the context of “computer”, and “calculate (calculate) average value” as the context of “calculate”. It is also possible to extract “average value” from such an expression. Such various variations of contexts may be handled together, or each context may be treated as a distinct feature. In the present embodiment, an example will be described in which two different types of contexts are treated as distinctive features. FIG. 10 shows an example of the result of extracting an adjective and an adjective verb that appear before the word of interest as a different type of context from FIG.

以下では、素性ベクトル抽出サブプログラム１１２１で実行される、文脈行列の作成方法について図１１のフローチャートを用いて説明する。 Hereinafter, a method for creating a context matrix executed by the feature vector extraction subprogram 1121 will be described with reference to the flowchart of FIG.

まず、ステップ１４０１においてテキストを読み込み、形態素解析処理を行う。形態素解析結果の例を図１２に示す。形態素解析結果は、テキストを単語に分割した結果に品詞が付与されたものである。形態素解析結果は、メモリ上に一時的に保持されることを想定しているが、一旦ファイルなどに格納しておいても良い。なお、文単位、あるいはパラグラフ、ファイルなどを単位として形態素解析を行いながら、ステップ１４０２以降の処理を行っても良い。 First, in step 1401, text is read and morphological analysis processing is performed. An example of the morphological analysis result is shown in FIG. The morphological analysis result is obtained by adding the part of speech to the result of dividing the text into words. The morphological analysis result is assumed to be temporarily stored in the memory, but may be temporarily stored in a file or the like. Note that the processing after step 1402 may be performed while performing morphological analysis in sentence units, paragraphs, files, or the like.

ステップ１４０２では、形態素解析結果中の全ての単語について処理を行ったかどうか判定する。全て処理済みであれば、全体の処理を終了する。未処理の単語があれば、ステップ１４０３に進む。判定は、全単語の中から１番目の単語、２番目の単語というように順次処理をしていけば良い。 In step 1402, it is determined whether or not all words in the morphological analysis result have been processed. If all processing has been completed, the entire processing ends. If there is an unprocessed word, the process proceeds to step 1403. The determination may be made by sequentially processing the first word and the second word among all the words.

ステップ１４０３では、ｉ番目の単語に着目し、近傍の単語の品詞列を所定の品詞パターンと照合する。品詞パターンの例を図１３に示す。パターン１は、注目している単語に対し、後続する動詞を文脈として抽出するためのパターンであり、名詞の後に助詞が続き、さらに動詞が続くという品詞の並びを抽出することを表している。パターン２は、注目している単語に対し、直前に出現する形容詞・形容動詞を文脈として抽出するためのパターンであり、形容詞あるいは形容動詞の後に名詞が続くという品詞の並びを抽出することを示している。図中、品詞の後の（Ｔ）は注目単語であることを示し、（Ｃ）は文脈単語（列）であることを示している。 In step 1403, focusing on the i-th word, the part-of-speech string of a nearby word is checked against a predetermined part-of-speech pattern. An example of a part of speech pattern is shown in FIG. Pattern 1 is a pattern for extracting a subsequent verb as a context for a word of interest, and represents extracting a part-of-speech sequence in which a noun is followed by a particle and further a verb. Pattern 2 is a pattern for extracting the adjective / adjective verb that appears immediately before the word of interest as the context, and indicates that the adjective or a noun is followed by an adjective verb. ing. In the figure, (T) after the part of speech indicates that the word is an attention word, and (C) indicates that it is a context word (sequence).

パターンが形態素解析結果とマッチしたら、ステップ１４０４に進み、マッチング結果に基づいて、パターンの注目単語にマッチした形態素解析結果と文脈単語（列）とマッチした形態素解析結果を抽出し、文脈行列に格納する。文脈行列は、パターン毎に作成する。 If the pattern matches the morpheme analysis result, the process proceeds to step 1404. Based on the matching result, the morpheme analysis result that matches the attention word of the pattern and the morpheme analysis result that matches the context word (sequence) are extracted and stored in the context matrix. To do. A context matrix is created for each pattern.

図１２の形態素解析結果に対しては、ｉが１の場合に、「コンピュータ」、「を」、「起動する」という単語列、ｉが６の場合に、「ウインドウ」、「が」、「現れる」という単語列がパターン１によって抽出される。また、「新しい」、「ウインドウ」という単語列がパターン２によって抽出される。またパターン中の注目単語、文脈単語の区別により、それぞれの抽出結果から、「コンピュータ」という注目単語に対し、「を起動する」が文脈として抽出される。また、「ウインドウ」という注目単語に対し、「が現れる」が文脈として抽出される。同様に、「ウインドウ」という注目単語に対し、「新しい」が文脈として抽出される。 For the morphological analysis result of FIG. 12, when i is 1, the word string “computer”, “on”, “activate”, and when i is 6, “window”, “ga”, “ The word string “appears” is extracted by pattern 1. Further, the word strings “new” and “window” are extracted by the pattern 2. In addition, by distinguishing the attention word and the context word in the pattern, “activate” is extracted as the context for the attention word “computer” from each extraction result. In addition, “appears” is extracted as the context for the attention word “window”. Similarly, “new” is extracted as the context for the attention word “window”.

以上の処理によって文脈行列を作成することができる。文脈行列はパターン毎に作成するため、各文脈行列から得られる類似度は異なる素性となる。また、式（１）には文書長正規化のための定数が含まれているが、この定数は自動的には決定できない。そのため、この値を０から１の間の適当な値に変動させ、類似度を計算する。例えば、定数を０．１、０．３、０．５、０．７の４種類の値で計算し、文脈行列としては、図１３に示した２種類のパターンに対応する２個の文脈行列を用いて類似度を計算したとする。その場合には、４×２＝８種類の素性が得られることになる。 The context matrix can be created by the above processing. Since the context matrix is created for each pattern, the similarity obtained from each context matrix has a different feature. In addition, equation (1) includes a constant for document length normalization, but this constant cannot be determined automatically. Therefore, this value is changed to an appropriate value between 0 and 1, and the similarity is calculated. For example, the constant is calculated with four types of values of 0.1, 0.3, 0.5, and 0.7, and the context matrix includes two context matrices corresponding to the two types of patterns shown in FIG. Assume that the similarity is calculated using. In that case, 4 × 2 = 8 types of features are obtained.

（２）表記ベース類似度
以下では、表記ベース類似度を計算する方法について説明する。表記ベース類似度は、単語の組に対し、文字の情報に基づいて類似度を計算する。同義語が特に、「コンピュータ」と「コンピューター」のような異表記語の場合、非特許文献２に開示されているように、多くの文字が重複していることから文字の重複している割合は類似度として用いることができる。異表記語はカタカナ語の場合が多いが、漢字からなる異表記語以外でも、「分析」と「解析」、「信頼」と「信用」のように同じ文字が含まれることがある。そこで、カタカナ語に限定せず、文字の重複度によって、類似度を計算する。以下では、文字の重複割合に基づく類似度を文字重複度と呼ぶ。漢字からなる単語の場合、特に２文字単語のような文字数が短い単語の場合は、「分析」と「透析」のように同じ文字を含んでいても意味が異なる単語が多く存在する。本発明では、文脈ベース類似度のような異なる種類の類似度と組み合わせることによって、文字重複度が有効に作用する。 (2) Notation Base Similarity Hereinafter, a method for calculating the notation base similarity will be described. The notation-based similarity is calculated for a set of words based on character information. In the case where synonyms are particularly different notations such as “computer” and “computer”, as disclosed in Non-Patent Document 2, since many characters are duplicated, the ratio of overlapping characters Can be used as similarity. Different notation words are often in katakana, but there are cases where the same characters are included, such as “analysis” and “analysis”, “trust” and “trust”, in addition to different notation words consisting of kanji. Therefore, the degree of similarity is calculated based on the overlapping degree of characters, not limited to Katakana. Hereinafter, the similarity based on the overlapping ratio of characters is referred to as a character overlapping degree. In the case of a word composed of Kanji characters, especially in the case of a word with a short number of characters such as a two-character word, there are many words having different meanings even if they include the same character, such as “analysis” and “dialysis”. In the present invention, the character overlap degree works effectively by combining with different kinds of similarities such as context-based similarity.

さらに、漢字の場合には、異なる文字であっても意味が類似している文字が存在する。例えば、「慕（う）」、「憧（れる）」のような文字は類似した意味を持っている。このような文字の類似性を教師データから学習することができれば、文字が完全に一致していない場合でも、単語間の表記ベース類似度を計算することが出来る。文字の類似性に基づく単語の類似度を類似文字重複度と呼ぶ。 Further, in the case of kanji, there are characters that have similar meanings even if they are different characters. For example, characters such as “U” and “Yu” have similar meanings. If such character similarity can be learned from the teacher data, the notation base similarity between words can be calculated even if the characters do not completely match. Word similarity based on character similarity is called similar character overlap.

（ａ）文字重複度
文字の重複度は、様々な方法で計算することができるが、ここでは一例として２個の単語間で共通に含まれている文字をカウントし、２個の単語のうち短い方の単語の文字列長で正規化することで計算する方法を説明する。同じ文字が複数含まれている場合には、一方にｍ個、他方の単語にｎ個含まれている場合には、ｍ対ｎの対応関係となる。このような場合は、ｍ又はｎの小さい方の個数の文字が重複したものとする。 (A) Character overlap The character overlap can be calculated by various methods. Here, as an example, the number of characters included in common between two words is counted. A method of calculation by normalizing the character string length of the shorter word will be described. When a plurality of the same characters are included, m corresponds to one, and when n is included in the other word, there is an m-to-n correspondence. In such a case, it is assumed that the smaller number of characters m or n overlaps.

以下では、２個の単語ｉと単語ｊの表記ベース類似度の計算方法について図１４を用いて説明する。 Below, the calculation method of the notation base similarity of two words i and j is demonstrated using FIG.

ステップ１４１１において、単語ｉの全ての文字を処理したかどうか調べる。処理していれば、ステップ１４１５に進む。未処理の文字があれば、ステップ１４１２に進む。ステップ１４１２では、単語ｊの全ての文字を処理したかどうか調べる。処理していれば、ステップ１４１１に進む。未処理の文字があれば、ステップ１４１３に進む。 In step 1411, it is checked whether or not all characters of word i have been processed. If so, go to Step 1415. If there is an unprocessed character, the process proceeds to step 1412. In step 1412, it is checked whether all characters of word j have been processed. If so, the process proceeds to step 1411. If there is an unprocessed character, the process proceeds to step 1413.

ステップ１４１３では、単語ｉのｍ番目の文字と単語ｊのｎ番目の文字を比較し、一致するかどうか調べる。一致していれば、ステップ１４１４に進む。一致していなければ、ステップ１４１２に進む。ステップ１４１４では、単語ｉのｍ番目の文字と単語ｊのｎ番目の文字にそれぞれフラグを立てる。その後、ステップ１４１２に進む。 In step 1413, the m-th character of word i and the n-th character of word j are compared to determine whether they match. If they match, the process proceeds to step 1414. If not, the process proceeds to step 1412. In step 1414, a flag is set for each of the mth character of word i and the nth character of word j. Thereafter, the process proceeds to Step 1412.

ステップ１４１５では、単語ｉ、単語ｊのフラグが立った文字数をそれぞれカウントし、小さい方を一致文字数とする。例えば、「ウインドウ」と「ウィンドー」が処理対象であると仮定すると、「ウ」、「ン」、「ド」の３文字が一致する。「ウ」については、「ウインドウ」に２文字含まれているため、「ウインドウ」中でフラグが立った文字は４文字、「ウィンドー」中でフラグが立った文字は３文字となる。よって、３文字が一致したものとする。 In step 1415, the number of characters with the flags of word i and word j are counted, and the smaller one is set as the number of matching characters. For example, assuming that “window” and “window” are to be processed, the three characters “c”, “n”, and “do” match. As for “c”, two characters are included in the “window”, so that 4 characters are flagged in the “window”, and 3 characters are flagged in the “window”. Therefore, it is assumed that the three characters match.

以上の方法以外にも、２個の単語の語頭からの共通部分文字列長を重複度とする、２個の単語の語末からの共通部分文字列長を重複度とする、正規化する文字列長を両者の平均とする、長い方とするなどのバリエーションが考えられる。また、より精緻な方法として、例えば、ＤＰマッチングなどによって２個の単語を照合し、マッチした文字数に基づいて表記ベース類似度を計算することも可能であり、利用可能な計算リソースに応じて、より多数の表記ベース類似度を計算することもできる。また、文字の頻度に基づいて、文字が一致した際の重みを変更することもできる。文書の検索において、単語の重みを計算する方法としてＩＤＦ（Inversed Document Frequency）が知られているが、同様の考え方で多くの単語に共通して含まれている文字の重要性は小さいと考えることで文字の重みを計算することができる。 In addition to the above method, a character string to be normalized which has a common partial character string length from the beginning of two words as a degree of duplication and a common partial character string length from the end of two words as a degree of duplication Variations such as taking the length as the average of the both and the longer are considered. Further, as a more precise method, for example, it is possible to match two words by DP matching or the like and calculate the notation base similarity based on the number of matched characters, depending on the available calculation resources, A larger number of notation-based similarities can also be calculated. Also, the weight when the characters match can be changed based on the frequency of the characters. In document search, IDF (Inversed Document Frequency) is known as a method for calculating word weights, but it is considered that characters included in many words are less important in the same way. Can calculate the weight of the character.

（ｂ）類似文字重複度
同義語辞書から文字の類似度を学習し、類似文字も含めて文字の重複度を計算する。文字の類似度の計算方法について、図１５に示すフローチャートを用いて説明する。 (B) Similar Character Duplication Degree The character similarity is learned from the synonym dictionary, and the character duplication degree is calculated including similar characters. A method of calculating the similarity of characters will be described with reference to the flowchart shown in FIG.

ステップ１４２１において、同義語辞書から同義語である単語ペアを取得する。次に、ステップ１４２２において、単語ペアの一方の単語から取り出した文字と他方の単語から取り出した文字からなる文字ペアを全ての組み合わせについて取得する。例えば、「敬慕」、「憧憬」が同義語である単語ペアの場合、「敬」／「憧」、「敬」／「憬」、「慕」／「憧」、「慕」／「憬」という４種類の文字ペアを取得する。 In step 1421, word pairs that are synonyms are acquired from the synonym dictionary. Next, in step 1422, character pairs made up of characters extracted from one word of the word pair and characters extracted from the other word are acquired for all combinations. For example, in the case of a word pair in which “respect” and “respect” are synonyms, “respect” / “reward”, “respect” / “reel”, “rear” / “reward”, “reel” / “reel” 4 types of character pairs are acquired.

次に、ステップ１４２３に進み、同義語辞書中の全ての単語に含まれる文字の頻度を計算する。次に、ステップ１４２４に進み、全ての文字ペアについて文字類似度を計算する。文字類似度は、文字ペアの頻度を、文字ペアを構成する２個の文字の頻度で割ったもの（Dice係数）を用いる。自己相互情報量等を類似度として用いても良い。 Next, proceeding to step 1423, the frequency of characters included in all words in the synonym dictionary is calculated. Next, proceeding to step 1424, character similarity is calculated for all character pairs. The character similarity is obtained by dividing the frequency of a character pair by the frequency of two characters constituting the character pair (Dice coefficient). Self-mutual information amount or the like may be used as the similarity.

ステップ１４２５では、ステップ１４２４で計算した類似度について、同じ文字についての類似度と異なる文字についての類似度を正規化する。具体的には、同じ文字についての類似度の平均ＡＳと異なる文字についての類似度の平均ＡＤをそれぞれ計算する。同じ文字については、計算した類似度に関わらず、１．０を設定する。異なる文字については、ステップ１４２４で計算した値にＡＤ／ＡＳを掛け算した値を最終的な類似度とする。文字類似度テーブルの例を図１６に示す。 In step 1425, with respect to the similarity calculated in step 1424, the similarity for the character different from the similarity for the same character is normalized. Specifically, the average AS of similarity for the same character and the average AD of similarity for different characters are respectively calculated. For the same character, 1.0 is set regardless of the calculated similarity. For different characters, the value obtained by multiplying the value calculated in step 1424 by AD / AS is used as the final similarity. An example of the character similarity table is shown in FIG.

文字類似度テーブルを利用して類似文字重複度を計算することが可能である。類似文字重複度の計算は、文字重複度の計算と同様に行えば良い。異なる部分は、文字重複度では文字が一文字一致した場合に、文字数１を加算していたのに対し、類似文字重複度の場合は、類似文字テーブルを参照し、類似文字である場合には、文字類似度を加算する点である。文字が一致する場合には、類似文字テーブルには１．０が格納されているため、文字重複度と同じである。 It is possible to calculate the similar character overlap degree using the character similarity table. The similar character overlap degree may be calculated in the same manner as the character overlap degree. In the case of different characters, the number of characters is added by 1 when the characters match in the character overlap, whereas in the case of the similar character overlap, the similar character table is referred to. It is a point to add character similarity. When the characters match, 1.0 is stored in the similar character table, and thus the character overlap is the same.

（３）共起ベース類似度
共起ベース類似度は、テキスト中で同時に出現する可能性の高さを示している。通常、同義語は同時に出現しにくいと言われている。例えば、「コンピュータ」と「コンピューター」のような異表記は、いずれか一方を使うことが推奨されており、同じ文書内で両方の表記が同時に出現することは稀である。しかしながら、「欧州連合」と「ＥＵ」のような略語などは、同じテキスト中に同時に使われることも多い。そのため、共起頻度は同義語を抽出するための手掛かりとなり得る。 (3) Co-occurrence-based similarity The co-occurrence-based similarity indicates a high possibility of appearing simultaneously in the text. Usually, synonyms are said to be difficult to appear at the same time. For example, it is recommended to use one of the different notations such as “computer” and “computer”, and it is rare that both notations appear at the same time in the same document. However, abbreviations such as “European Union” and “EU” are often used simultaneously in the same text. Therefore, the co-occurrence frequency can be a clue for extracting synonyms.

形態素解析結果中で、ｉ番目の単語に着目し、注目単語から予め定められたＮ単語以内の位置に出現した単語と注目単語との共起を全て抽出し、共起頻度テーブルに格納する。共起頻度テーブルの例を図１７に示す。また、出現した個々の単語の出現頻度を同時に計算し、単語頻度テーブルに格納する。単語頻度テーブルの例を図１８に示す。単語頻度テーブルと共起頻度テーブルの値から、共起ベース類似度として、例えばDice係数を計算する。Dice係数は、単語Ａ，Ｂの頻度をそれぞれｆ(Ａ)、ｆ(Ｂ)、共起頻度をＦ(Ａ，Ｂ)とするとき、Ｆ(Ａ，Ｂ)／（ｆ(Ａ)＋ｆ(Ｂ)）で計算できる。他にも、自己相互情報量など他の尺度を使うことも出来るし、複数種類を用いても構わない。図１９に共起類似度テーブル１１８の例を示す。 In the morpheme analysis result, paying attention to the i-th word, all the co-occurrence of the word and the attention word appearing at positions within the predetermined N words from the attention word are extracted and stored in the co-occurrence frequency table. An example of the co-occurrence frequency table is shown in FIG. In addition, the appearance frequencies of the individual words that have appeared are simultaneously calculated and stored in the word frequency table. An example of the word frequency table is shown in FIG. For example, a Dice coefficient is calculated as the co-occurrence base similarity from the values of the word frequency table and the co-occurrence frequency table. The Dice coefficient is F (A, B) / (f (A) + f (), where the frequencies of the words A and B are f (A) and f (B), respectively, and the co-occurrence frequency is F (A, B). B)). In addition, other measures such as the amount of self-mutual information can be used, or a plurality of types can be used. FIG. 19 shows an example of the co-occurrence similarity table 118.

以上の処理によって、同義語を従来技術と比較して高精度に抽出することが可能になる。結果の例を図２０に示す。図２０は、従来方式（文脈ベース類似度による教師なし学習、文脈単語を用いた教師あり学習）と本発明の方式の比較結果を示している。Ｗｅｂから収集した約１０ＧＢ程度の日本語テキストを利用した。また、評価指標である平均適合率は、文書検索精度の評価において通常用いられる尺度であり、適合率（ノイズの少なさを示す尺度）、再現率（漏れの少なさを示す尺度）を総合的に判断するための尺度である。適合率と再現率は、通常トレードオフの関係にあり、同じ方式においてパラメータを変化させると一方が良くなり、他方が悪くなる。例えば、ある同義語抽出方式において、抽出する同義語候補数を増加させると再現率は向上する（漏れが少なくなる）が、適合率は悪化する（ノイズが増加する）。そのため、方式同士の比較においては、単純に適合率のみを比較しても意味がない。平均適合率では、再現率を１０％、２０％、３０％のように変化させながら、各再現率における適合率を取得し、平均を取ることで方式同士の比較を正確に行うことができる。 Through the above processing, synonyms can be extracted with higher accuracy than in the prior art. An example of the result is shown in FIG. FIG. 20 shows a comparison result between the conventional method (unsupervised learning based on context-based similarity and supervised learning using context words) and the method of the present invention. About 10 GB Japanese text collected from the Web was used. The average accuracy rate, which is an evaluation index, is a measure that is usually used in evaluating document search accuracy, and it is a comprehensive measure of accuracy rate (a measure that indicates low noise) and recall rate (a measure that indicates low leakage). It is a scale for judging. The precision ratio and the recall ratio are usually in a trade-off relationship, and when the parameter is changed in the same method, one becomes better and the other becomes worse. For example, in a certain synonym extraction method, when the number of synonym candidates to be extracted is increased, the recall is improved (leakage is reduced), but the precision is deteriorated (noise is increased). Therefore, in comparison between methods, it is meaningless to simply compare only the precision. In the average precision, the precision can be accurately compared by acquiring the precision at each recall and changing the recall as 10%, 20%, and 30%.

＃１は非特許文献１に開示されている方式にあたり、＃２は非特許文献６に開示されている方法にあたる。教師なし方式である＃１と比較して、従来方式である＃２も含め教師あり方式の方が優れていることが分かる。また、教師あり方式同士の比較についても、文脈単語を素性として用いる従来方式＃２と比較して、類似度を素性として用いる提案方式＃３の方が、精度が良いことが分かる。また、＃３で用いている文脈ベース類似度に加えて、文字重複度（＃４）、類似文字重複度（＃５）のような異なる素性を組み合わせて用いる方が、精度が向上することも分かる。 # 1 corresponds to the method disclosed in Non-Patent Document 1, and # 2 corresponds to the method disclosed in Non-Patent Document 6. It can be seen that the supervised method including the conventional method # 2 is superior to the unsupervised method # 1. In comparison between supervised methods, it can be seen that the proposed method # 3 using similarity as a feature is more accurate than the conventional method # 2 using context words as a feature. Also, in addition to the context-based similarity used in # 3, the accuracy may be improved by using different features such as character redundancy (# 4) and similar character redundancy (# 5) in combination. I understand.

なお、以上の説明では、図４のステップ１６の処理において、同義語辞書に含まれない単語ペアを負例として使用する方法を説明した。この方法は、同義語辞書に含まれていない単語ペアだからといって、必ずしも同義語ではないとは言えない、という問題を回避するための方法である。もう一つの方法として、識別器として１−クラスＳＶＭを用いることで、この問題を回避することが可能である。１−クラスＳＶＭは、正例のみから識別器を学習することができる技術であり、麻生英樹、津田宏治、村田昇「パターン認識と学習の統計学新しい概念と手法、統計科学のフロンティア」岩波書店（２００３年）に開示されているので説明を省略する。１−クラスＳＶＭを用いる場合には、図４のステップ１６の処理において、ラベルとして「正解」が付与された行のみを教師データとして使用し、識別器として１−クラスＳＶＭを用いて学習を行う。これにより、正例、すなわち同義語辞書に含まれている単語ペアに関する情報のみから、識別器を構成することが可能となる。 In the above description, the method of using a word pair not included in the synonym dictionary as a negative example in the process of step 16 in FIG. 4 has been described. This method is a method for avoiding the problem that a word pair that is not included in the synonym dictionary is not necessarily a synonym. As another method, this problem can be avoided by using 1-class SVM as a discriminator. 1-Class SVM is a technology that can learn classifiers only from positive examples. Hideki Aso, Koji Tsuda, Noboru Murata “New concepts and methods of pattern recognition and learning, frontier of statistical science” Iwanami Shoten (2003), the description is omitted. In the case of using 1-class SVM, in the process of step 16 in FIG. 4, only the row assigned “correct” as a label is used as teacher data, and learning is performed using 1-class SVM as a discriminator. . This makes it possible to configure the discriminator only from the positive example, that is, information relating to the word pair included in the synonym dictionary.

こうして本発明の第１の実施の形態の同義語抽出装置によると、既存の同義語辞書に含まれていない同義語を含む同義語辞書が出力される。 Thus, according to the synonym extraction device of the first exemplary embodiment of the present invention, a synonym dictionary including synonyms that are not included in the existing synonym dictionary is output.

［第２の実施の形態］
以下、本発明の第２の実施の形態であるシソーラス抽出装置を、図面を参照して説明する。第１の実施の形態では、単語意味関係抽出の問題を同義語であるか、同義語でないかを識別する問題として解決する。しかしながら、実際の単語意味関係抽出では、より曖昧な状況が存在する。例えば、上位・下位語は、厳密な意味での同義語ではないが、意味は類似している。例えば、「企業」と「メーカ」が相当する。また、兄弟語、すなわち共通の語を上位語として持つ語の場合も同様である。例えば、「証券会社」と「銀行」が相当する。 [Second Embodiment]
Hereinafter, a thesaurus extraction apparatus according to a second embodiment of the present invention will be described with reference to the drawings. In the first embodiment, the word semantic relationship extraction problem is solved as a problem for identifying whether a synonym or not. However, there is a more ambiguous situation in actual word semantic relationship extraction. For example, the upper and lower terms are not synonyms in the strict sense, but the meanings are similar. For example, “company” and “manufacturer” correspond. The same applies to siblings, that is, words having a common word as a broader word. For example, “securities company” and “bank” are equivalent.

第２の実施の形態では、このような状況を適切に扱うことができる単語意味関係抽出装置を実現できる。第２の実施の形態では、単語意味関係抽出の問題を、２値の識別問題ではなく、ランキング問題として扱うことで課題を解決する。すなわち、同義語の場合は非常に類似性が高いということでランクとして１を付与し、上位・下位語や兄弟語の場合は同義語ほどではないが、ある程度類似性が高いということでランクとして２を付与し、そのいずれでもない場合には、類似性が低いということでランクとして３を付与する問題だと考える。そして、第１の実施の形態と同様に、人手作成された辞書によってランクを正解として付与した教師データから、ランキングを行う関数を学習することによって単語意味関係抽出を行う。 In the second embodiment, it is possible to realize a word meaning relationship extraction device that can appropriately handle such a situation. In the second embodiment, the problem is solved by treating the word semantic relationship extraction problem not as a binary identification problem but as a ranking problem. In other words, in the case of synonyms, 1 is given as the rank because it is very similar, and in the case of upper / lower terms and siblings, it is not as synonymous, but it is ranked as if it is somewhat similar. If 2 is assigned and none of them is given, it is considered that the similarity is low and 3 is assigned as a rank. Then, as in the first embodiment, word semantic relationship extraction is performed by learning a function for ranking from teacher data to which a rank is assigned as a correct answer by a manually created dictionary.

第２の実施の形態では、第１の実施の形態の図４におけるステップ１６、ステップ１７、ステップ１８を以下のように変更する。 In the second embodiment, Step 16, Step 17, and Step 18 in FIG. 4 of the first embodiment are changed as follows.

まず、ステップ１６の変更について説明する。第１の実施の形態では、同義語辞書を参照し、正例である場合には「＋１」、負例である場合には「−１」という２値のラベルを設定した。ただし、ここでは不明の単語ペアは対象外とする。第２の実施の形態では、語の上位・下位関係を含むシソーラス辞書を参照することによってラベルを設定する。シソーラス辞書を参照し、単語の組が同義語であれば、ラベルとして「１」を付与する。単語の組が上位・下位語、あるいは兄弟語であればラベルとして「２」を付与する。それ以外の場合の処理の考え方は、第１の実施の形態と同様である。すなわち、単語の組はシソーラス辞書に含まれないが、単語それぞれはシソーラス辞書に含まれている場合には、不正解のラベルとして「３」を付与する。単語の組のいずれか一方の単語が同義語辞書に含まれていない場合には、不明（−１とする）のラベルを付与する。 First, the change in step 16 will be described. In the first embodiment, the synonym dictionary is referred to, and a binary label “+1” is set for a positive example and “−1” is set for a negative example. However, unknown word pairs are excluded here. In the second embodiment, a label is set by referring to a thesaurus dictionary including upper / lower relations of words. If a set of words is synonymous with reference to the thesaurus dictionary, “1” is assigned as a label. If the word set is a broader / lower word or sibling, “2” is assigned as a label. The concept of processing in other cases is the same as in the first embodiment. That is, when the word set is not included in the thesaurus dictionary, but each word is included in the thesaurus dictionary, “3” is assigned as the incorrect answer label. If any one of the word pairs is not included in the synonym dictionary, an unknown (-1) label is assigned.

図２１に、第２の実施の形態における類似度行列の例を示す。ラベル欄に、＜コンピュータ，コンピューター＞のような同義語については１、＜マシン，コンピュータ＞のような上位・下位語については２、＜計算機，仮想化技術＞のような上記のいずれでもない語については３というランクが付与されている点が第１の実施の形態と異なる。 FIG. 21 shows an example of the similarity matrix in the second embodiment. In the label field, 1 for synonyms such as <computer, computer>, 2 for broader and lower-order words such as <machine, computer>, and none of the above words such as <computer, virtualization technology> Is different from the first embodiment in that a rank of 3 is assigned.

ステップ１７については、２値の識別モデルの学習ではなくランキング学習を行うように変更する。ランキング学習を行う分類器としては、例えば、T. Joachims, Training Linear SVMs in Linear Time, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006.に開示されているので説明を省略する。 Step 17 is changed to perform ranking learning rather than binary identification model learning. The classifier that performs ranking learning is disclosed in, for example, T. Joachims, Training Linear SVMs in Linear Time, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006.

ステップ１８では、設定される値が２値ではなく、学習されたモデルにしたがって判定されたランクを示す値である点が異なる。また、設定される値が２値ではないため、辞書エディタでも画面が異なる。例えば、図２２のような表示を行うことで修正を行うことができる。図２２の例では、ラベルとして付与されたランクと、判定結果のランキングが一定の閾値以上に大きい単語ペアを表示し、さらに初期値としてラベルとして付与されたランクに対応する項目（図２２の場合、「同義語」、「上位・下位語」、「それ以外」のいずれか）にチェックを付与する。ユーザが誤っていると判断した場合には、チェックを付け直し、チェックが変更された箇所のみを辞書に反映することで辞書を修正する。 Step 18 is different in that the value to be set is not a binary value but a value indicating a rank determined according to the learned model. Further, since the set value is not binary, the screen is different even in the dictionary editor. For example, the correction can be performed by performing the display as shown in FIG. In the example of FIG. 22, the ranks given as labels and the word pairs whose determination result ranking is larger than a certain threshold are displayed, and items corresponding to the ranks given as labels as initial values (in the case of FIG. 22) , “Synonyms”, “higher / lower terms”, or “other than that”). When it is determined that the user is wrong, the check is added again, and the dictionary is corrected by reflecting only the portion where the check is changed in the dictionary.

こうして本発明の第２の実施の形態のシソーラス抽出装置によると、既存のシソーラス辞書に含まれていない同義語、上位・下位語、兄弟語を含むシソーラス辞書が出力される。 Thus, according to the thesaurus extraction device of the second exemplary embodiment of the present invention, a thesaurus dictionary including synonyms, broader / lower terms, and siblings not included in the existing thesaurus dictionary is output.

［第３の実施の形態］
以下、本発明の第３の実施の形態である対訳関係抽出装置を、図面を参照して説明する。第３の実施の形態では、単語関係として異なる言語間の対訳関係を抽出する。対訳関係は、同義語関係を異なる言語の単語間に拡張したものだと見ることができる。よって、第１の実施の形態と同様の考え方によって対訳関係抽出を行うことが可能である。第３の実施の形態では、第１の実施の形態と同様のシステム構成を用いる。ただし、第１の実施の形態と構成が異なるのは、同義語辞書の替わりに対訳辞書を用いる点である。対訳辞書１１４３の例を図２３に示す。対訳辞書は同義語辞書と全く同じ形式であり、同義語の替わりに訳語が格納されている。 [Third Embodiment]
Hereinafter, a parallel translation extracting apparatus according to a third embodiment of the present invention will be described with reference to the drawings. In the third embodiment, parallel translation relationships between different languages are extracted as word relationships. A bilingual relationship can be viewed as an extension of a synonym relationship between words in different languages. Therefore, it is possible to perform parallel translation relationship extraction based on the same concept as in the first embodiment. In the third embodiment, a system configuration similar to that in the first embodiment is used. However, the configuration differs from that of the first embodiment in that a bilingual dictionary is used instead of the synonym dictionary. An example of the bilingual dictionary 1143 is shown in FIG. The bilingual dictionary has exactly the same format as the synonym dictionary, and translations are stored instead of synonyms.

図２４に対訳抽出の場合の、類似度行列の例を示す。図３の例では、単語ペアが同一言語の単語のペアからなっていたのに対し、図２４の例では第１の言語の単語と第２の言語の単語からなる単語ペアが格納されている。 FIG. 24 shows an example of a similarity matrix in the case of parallel translation extraction. In the example of FIG. 3, the word pair is made up of a pair of words in the same language, whereas in the example of FIG. 24, a word pair made up of a word in the first language and a word in the second language is stored. .

全体の処理の流れは、図４のフローチャートと同様である。ただし、ステップ１３、ステップ１４における処理の詳細が若干異なる。 The overall processing flow is the same as the flowchart of FIG. However, the details of the processing in step 13 and step 14 are slightly different.

ステップ１３では、単語ペアを取得する際の実現方法が異なる。第１の実施の形態では、同じ言語の全ての単語の中から任意の異なる単語の組を抽出して単語ペアとするのに対し、本実施の形態では、第１言語の単語と第２言語の単語の組み合わせによって単語ペアを取得する。具体的には、第１言語のテキストを形態素解析して得られた単語のリストと第２言語のテキストを形態素解析して得られた単語のリストからそれぞれ任意の単語を取得し、単語ペアとする。 In step 13, the realization method when acquiring a word pair is different. In the first embodiment, arbitrary different word pairs are extracted from all words in the same language to form word pairs, whereas in this embodiment, the first language word and the second language Get word pairs by word combinations. Specifically, an arbitrary word is obtained from a list of words obtained by morphological analysis of text in the first language and a list of words obtained by morphological analysis of text in the second language, To do.

ステップ１４では、単語ペアに対する類似度計算方法が異なる。以下、対訳抽出における類似度計算方法について詳細に説明する。 In step 14, the similarity calculation method for the word pair is different. Hereinafter, the similarity calculation method in bilingual extraction will be described in detail.

（１）多言語文脈ベース類似度
対訳抽出の場合、単語ペアを構成する２個の単語は異なる言語である。以下では、一方が日本語、他方が英語の場合を想定して説明する。よって、それぞれの単語の文脈も異なる言語となる。そのため、文脈単語列の一致によって類似度を計算することができない。このとき、対訳辞書を用いることで、文脈中の単語同士を対応付けることで同義語抽出の場合と同様に文脈ベースの類似度を計算することができる。 (1) Multilingual context-based similarity In the case of parallel translation extraction, the two words constituting a word pair are different languages. In the following description, it is assumed that one is Japanese and the other is English. Therefore, the context of each word is also a different language. Therefore, the similarity cannot be calculated by matching the context word string. At this time, by using the bilingual dictionary, it is possible to calculate the context-based similarity by associating words in the context with each other in the same manner as in the case of synonym extraction.

図２５、図２６に対訳抽出における文脈行列の例を示す。図２５は、日本語テキストから抽出された文脈行列の例であり、図２６は、英語テキストから抽出された文脈行列の例である。同義語抽出の場合と異なるのは、図２５において、助詞を含めず動詞のみが文脈として抽出されている点である。これは、英語では助詞が存在しないこと、対訳辞書で対応付けを行うため、助詞を含めた文字列は通常辞書に含まれないことが理由である。ただし、助詞が存在しない点は、構文解析等の技術により、主格、目的格などの格解析を行い、助詞の代わりに使用することもできる。 FIG. 25 and FIG. 26 show examples of context matrices in parallel translation extraction. FIG. 25 is an example of a context matrix extracted from Japanese text, and FIG. 26 is an example of a context matrix extracted from English text. The difference from the case of synonym extraction is that in FIG. 25, only verbs are extracted as context without including particles. This is because, in English, there is no particle, and since correspondence is performed in the bilingual dictionary, the character string including the particle is not normally included in the dictionary. However, the point where the particle does not exist can be used in place of the particle by performing a case analysis of the main case, the objective case, etc. by a technique such as syntax analysis.

各言語の文脈行列を準備し、対訳辞書を用いて文脈情報間の対応付けを行うことで第１の実施の形態と同様に文脈に基づいた類似度を計算することができる。例えば、対訳辞書により、「起動する」と“boot”、「停止する」と“shutdown”等が対応していることが分かるため、「コンピュータ」と“computer”の文脈情報から類似度を計算することができる。 By preparing a context matrix for each language and making correspondence between context information using a bilingual dictionary, the similarity based on the context can be calculated as in the first embodiment. For example, the bilingual dictionary shows that “boot” corresponds to “boot”, “stop” corresponds to “shutdown”, etc., so the similarity is calculated from the context information of “computer” and “computer”. be able to.

（２）多言語表記ベース類似度
カタカナ語の外来語については、発音に基づいて、対訳関係を推定する技術が知られている。この種の技術は、Transliterationと呼ばれ、例えば、K. Knight and J. Graehl: Machine Transliteration, Computational Linguistics, 24(4), pp. 599-612, 1998.などに開示されている。単純な方法としては、“ｃｏ”は「コ」と、“ｍ”は「ン」又は「ム」、“ｐｕ”は「プ」又は「ピュ」と読むことができるという情報を準備しておき、“computer”から「コムプタ」、「コンプタ」、「コンピュタ」のような読みの候補を生成し、読みの候補と日本語単語の文字列を第１の実施の形態に述べたような方法で比較することによって類似度を計算することができる。 (2) Multilingual notation base similarity For foreign words in Katakana, a technique for estimating a bilingual relationship based on pronunciation is known. This type of technology is called Transliteration and is disclosed in, for example, K. Knight and J. Graehl: Machine Transliteration, Computational Linguistics, 24 (4), pp. 599-612, 1998. As a simple method, information that “co” can be read as “co”, “m” as “n” or “mu”, and “pu” as “pu” or “pyu” is prepared. , Reading candidates such as “Computer”, “Computer”, and “Computer” are generated from “computer”, and the reading candidates and character strings of Japanese words are generated by the method described in the first embodiment. Similarity can be calculated by comparison.

（３）多言語共起ベース類似度
対訳抽出の場合、文脈ベース類似度の場合と同様に、日本語の単語と英語の単語が共起するかどうかをテキストのみから得ることはできない。そのため、対訳辞書を用いて共起ベース類似度を計算する。具体的には、日本語のテキスト、英語のテキストからそれぞれ共起ベース類似度を計算し、共起類似度テーブルを作成しておく。対訳の単語ペアが与えられたら、単語ペアの一方を対訳辞書によって変換することで共起類似度テーブルと照合する。具体的には、単語ペアの日本語単語を対訳辞書によって英語に変換し、英語の共起類似度テーブルと照合し、類似度を取得する。複数の候補が存在する場合には、全てを取得する。同様に、単語ペアの英語単語を対訳辞書によって日本語に変換し、日本語の共起類似度テーブルと照合し、類似度を取得する。以上の処理によって、多言語の共起ベース類似度を計算することができる。 (3) Multilingual Co-occurrence Based Similarity In the case of parallel translation extraction, it is not possible to obtain from text alone whether Japanese words and English words co-occur as in the case of context-based similarity. Therefore, the co-occurrence base similarity is calculated using the bilingual dictionary. Specifically, the co-occurrence base similarity is calculated from the Japanese text and the English text, respectively, and a co-occurrence similarity table is created. When a bilingual word pair is given, one of the word pairs is converted by the bilingual dictionary and collated with the co-occurrence similarity table. Specifically, the Japanese word of the word pair is converted into English by a bilingual dictionary, and collated with an English co-occurrence similarity table to obtain the similarity. If there are multiple candidates, all are acquired. Similarly, the English word of the word pair is converted into Japanese by the bilingual dictionary and collated with the Japanese co-occurrence similarity table to obtain the similarity. Through the above processing, multilingual co-occurrence-based similarity can be calculated.

なお、以上の処理によって複数の類似度が得られるが、全ての類似度を計算する、日本語単語の英語変換によって得られた類似度の平均、英語単語の日本語変換によって得られた類似度の平均の２種類を用いる等、バリエーションが考えられる。対訳辞書の規模、テキストの規模によってどの方式が適しているかは変化するため、適用したいデータによって適切な方法を採用すれば良い。 In addition, although a plurality of similarities can be obtained by the above processing, all similarities are calculated, the average of the similarities obtained by English conversion of Japanese words, the similarity obtained by Japanese conversion of English words Variations are conceivable, such as using the average two types. Since which method is suitable depends on the size of the bilingual dictionary and the size of the text, an appropriate method may be adopted depending on the data to be applied.

こうして本発明の第３の実施の形態の対訳関係抽出装置によると、既存の対訳辞書に含まれていない対訳関係にある単語を含む対訳辞書が出力される。 Thus, according to the bilingual relationship extraction apparatus of the third exemplary embodiment of the present invention, a bilingual dictionary including words in a bilingual relationship that is not included in the existing bilingual dictionary is output.

１００単語意味関係抽出装置
１０１ＣＰＵ
１０２主メモリ
１０３入出力装置
１１０ディスク装置
１１１ＯＳ
１１２単語意味関係抽出プログラム
１１２１素性ベクトル抽出サブプログラム
１１２２正解ラベル設定サブプログラム
１１２３識別モデル学習サブプログラム
１１２４識別モデル適用サブプログラム
１１３テキスト
１１４人手作成辞書
１１４１同義語辞書
１１４２シソーラス辞書
１１４３対訳辞書
１１５類似度行列
１１６文脈行列
１１７品詞パターン
１１８共起類似度テーブル
１１９識別モデル
１２０文字類似度テーブル 100 word meaning relation extraction apparatus 101 CPU
102 Main memory 103 Input / output device 110 Disk device 111 OS
112 Word meaning relation extraction program 1121 Feature vector extraction subprogram 1122 Correct label setting subprogram 1123 Identification model learning subprogram 1124 Identification model application subprogram 113 Text 114 Manual creation dictionary 1141 Synonym dictionary 1142 Thesaurus dictionary 1143 Bilingual dictionary 115 Similarity matrix 116 Context Matrix 117 Part-of-Speech Pattern 118 Co-occurrence Similarity Table 119 Identification Model 120 Character Similarity Table

Claims

Means for generating a feature vector having a plurality of different similarities as elements for a set of words extracted from text;
Means for referring to a known dictionary and assigning a label indicating a word semantic relationship to the feature vector;
Means for learning a word semantic relationship determination rule based on a plurality of feature vectors to which the label is attached;
Means for determining a word semantic relationship for an arbitrary set of words based on the learned word semantic relationship determination rule;
A word meaning relationship extraction device comprising:

In the word meaning relationship extraction device according to claim 1,
The means for generating the feature vector includes:
Means for extracting a word in the vicinity of the appearance location in the text of the word of interest as context information of the word of interest;
Means for calculating the similarity between the context information of two words of the word set as the similarity of the word set;
A word meaning relationship extraction device comprising:

In the word meaning relationship extraction device according to claim 1,
The means for generating the feature vector includes:
Means for calculating a correspondence between characters included in two words of the set of words based on whether or not they are the same character;
Means for calculating the similarity of the set of words based on the correspondence between the characters;
A word meaning relationship extraction device comprising:

In the word meaning relationship extraction device according to claim 1,
The means for generating the feature vector includes:
Means for determining the similarity of characters contained in two words of the set of words;
Means for calculating similarity of the set of words based on similarity of the characters;
A word meaning relationship extraction device comprising:

In the word meaning relationship extraction device according to claim 1,
The means for generating the feature vector includes:
Means for extracting two words appearing within a certain distance from the text as a set of co-occurring words;
Means for calculating a statistic indicating the co-occurrence of words using the frequency of the co-occurring word sets as the similarity of the word sets;
A word meaning relationship extraction device comprising:

In the word meaning relationship extraction device according to claim 1,
The word semantic relationship is a relationship as to whether or not two words in the set of words are synonyms,
2. The word semantic relationship extracting apparatus according to claim 1, wherein the known dictionary is a synonym dictionary storing entry words and their synonyms.

In the word meaning relationship extraction device according to claim 1,
The word semantic relationship is whether the two words of the word set are synonyms, upper / lower relationships, sibling relationships, or neither.
2. The word semantic relationship extraction device according to claim 1, wherein the known dictionary is a thesaurus dictionary storing headwords and their synonyms, upper / lower terms, or siblings.

In the word meaning relationship extraction device according to claim 1,
The word semantic relationship is a parallel translation relationship of two words of the word set,
2. The word semantic relationship extracting apparatus according to claim 1, wherein the known dictionary is a bilingual dictionary storing headwords and their translations.

In the word meaning relation extraction device according to any one of claims 1 to 8,
Means for determining a label that is highly likely to be incorrect based on information on the assigned label and the determined word semantic relationship;
Means for displaying information about the label likely to be wrong;
Means for accepting user input and correcting the erroneous label;
A word meaning relationship extraction device comprising: