JP2013020431A

JP2013020431A - Polysemic word extraction system, polysemic word extraction method and program

Info

Publication number: JP2013020431A
Application number: JP2011152983A
Authority: JP
Inventors: Eiji Hirao; 英司平尾; Takeshi Furuhashi; 武古橋; Ohiro Yoshikawa; 大弘吉川
Original assignee: Nagoya University NUC; NEC Corp
Current assignee: Nagoya University NUC; NEC Corp
Priority date: 2011-07-11
Filing date: 2011-07-11
Publication date: 2013-01-31
Anticipated expiration: 2031-07-11
Also published as: JP5754018B2

Abstract

PROBLEM TO BE SOLVED: To provide a polysemic word extraction system for removing ambiguity of text by identifying a polysemic word, which is used in a meaning different from a general meaning, in a document group of a specific project such as a proposal or specifications for information system construction etc.SOLUTION: The polysemic word extraction system includes: a word analyzing section that extracts words from a given input text; a key word co-occurrence vector extraction section that selects a desired word as a key word and extracts key word co-occurrence words which are in a co-occurrence relation with the key word and a key word co-occurrence vector represented with the number of co-occurrences thereof; a co-occurrence word concept estimation section that estimates a co-occurrence word concept of each key word co-occurrence word of the key word co-occurrence vector on the basis of a general concept; a co-occurrence word classification section that performs clustering of key word co-occurrence words with respect to the selected key word based on the similarity among the corresponding co-occurrence word concepts on the estimated co-occurrence word concept group; a polysemic word candidate estimation section that extracts a polysemic word candidate when plural clusters are found,; and a polysemic word candidate output section that outputs the extracted candidate.

Description

本発明は、多義語抽出システム、多義語抽出方法およびプログラムに関し、特に、情報システム構築に関する提案書や仕様書等といった所定の案件に関する文書内で複数の意味を割り当てられている多義語を抽出する多義語抽出システム、方法およびプログラムに関する。 The present invention relates to a polysemy extraction system, a polysemy extraction method, and a program, and in particular, extracts a polysemy that is assigned a plurality of meanings in a document related to a predetermined matter such as a proposal or a specification for information system construction. The present invention relates to a polysemy extraction system, method, and program.

近年、情報処理装置を用いて、自然言語で書かれた文書を分析して、その文書の意味や意義を自動抽出するシステムが開発されている。そのなかで、文章中の多義語の取り扱いが問題になることがある。 In recent years, a system has been developed that uses an information processing apparatus to analyze a document written in a natural language and automatically extract the meaning and significance of the document. Among them, the handling of ambiguous words in sentences may become a problem.

多義語抽出システムに関する技術の一例が、特許文献１に「単語シソーラス構築システム」として記載されている。この特許文献１に開示された単語シソーラス構築システムは、文章解析部、名詞間距離計算部、名詞クラスタリング部、多義性解消部、再クラスタリング部、シソーラス生成部、データ格納部から構成されている。このような構成を有する単語シソーラス構築システムは、次のように動作する。
すなわち、文章解析部は、解析対象とするコーパス中の文章の形態素解析及び構文解析を実行して動詞各関係基礎データを生成し、名詞リスト、動詞リスト及び共起関係データを生成する。名詞間距離計算部は、生成した各リスト、及び共起関係データに基づいて名詞間距離を求める。名詞クラスタリング部は、計算された名詞間距離に基づいて名詞クラスタを生成する。多義性解消部は、この名詞クラスタの有するツリー構造に基づいて各名詞と共起関係のある動詞の多義性を解消し前記動詞リスト及び共起関係データを修正する。再クラスタリング部は、この多義性解消部によって修正された動詞リスト及び共起関係データに基づいて再度名詞クラスタリングを実行する。シソーラス生成部は、この再クラスタリング結果に基づいて単語のシソーラスを生成する。データ格納部は、解析対象である大量の文章であるコーパスと、このコーパスを解析することで生成された動詞格関係基礎データと、文章中に出現した動詞をその出現頻度と共に格納する動詞リストと、文書中に出現する名詞をその出現頻度と共に格納する名詞リストと、前記各リスト中の動詞と名詞の共起関係を格納する共起関係データと、名詞間距離計算部によって求められる名詞間距離と、クラスタリング処理によって生成された名詞クラスタと、シソーラス生成処理によって生成された名詞及び動詞のシソーラスとを格納する。このような構成により、文書中の単語について、動詞と名詞の単語間距離とに基づいて、動詞の多義性を判断し、この判断に基づいて単語リスト及び共起関係データを修正し、これに基づいて再度名詞のクラスタリングを行うことで、精度の高いシソーラスが構築できるとしている。 An example of a technique related to a polysemy extraction system is described in Patent Document 1 as a “word thesaurus construction system”. The word thesaurus construction system disclosed in Patent Document 1 includes a sentence analysis unit, an internoun distance calculation unit, a noun clustering unit, an ambiguity elimination unit, a reclustering unit, a thesaurus generation unit, and a data storage unit. The word thesaurus construction system having such a configuration operates as follows.
That is, the sentence analysis unit executes morphological analysis and syntactic analysis of sentences in the corpus to be analyzed, generates verb relation basic data, and generates a noun list, verb list, and co-occurrence relation data. The internoun distance calculation unit obtains the internoun distance based on each generated list and co-occurrence relation data. The noun clustering unit generates a noun cluster based on the calculated internoun distance. The ambiguity elimination unit eliminates the ambiguity of the verb having a co-occurrence relationship with each noun based on the tree structure of the noun cluster, and corrects the verb list and the co-occurrence relationship data. The re-clustering unit performs noun clustering again based on the verb list and the co-occurrence relation data corrected by the ambiguity eliminating unit. The thesaurus generator generates a word thesaurus based on the reclustering result. The data storage unit is a corpus that is a large amount of sentences to be analyzed, basic verb case relationship data generated by analyzing the corpus, a verb list that stores verbs that appear in the sentences together with their appearance frequencies, and , A noun list for storing the nouns appearing in the document together with their appearance frequency, co-occurrence relation data for storing the co-occurrence relations of verbs and nouns in each list, and the distance between nouns calculated by the internoun distance calculation unit And a noun cluster generated by the clustering process and a noun and verb thesaurus generated by the thesaurus generation process. With such a configuration, the ambiguity of the verb is determined for the words in the document based on the distance between the verb and the noun, and the word list and the co-occurrence relation data are corrected based on this determination. It is said that a highly accurate thesaurus can be constructed by clustering nouns again based on this.

さらに、多義語抽出システムに関する技術の他の例が、特許文献２に「機械翻訳装置」として記載されている。この特許文献２に開示された機械翻訳装置は、入力部、入力文字列記憶部、翻訳辞書部、辞書検索部、翻訳処理部、知識ベース部、単語シソーラス部、多義性解消部、翻訳結果出力部から構成されている。このような構成を有する機械翻訳装置は、次のように動作する。
入力部は、原言語文字列を入力する。次に、入力文字列記憶部は、入力された文字列を記憶する。翻訳辞書部は、原言語単語と相手言語の形態情報、原言語と相手言語の対訳情報などを保持している。辞書検索部は、翻訳辞書を検索する。翻訳処理部は、原言語を前記翻訳辞書部を参照して他の言語に翻訳し、翻訳処理において多義性を識別したとき、多義性解消部に多義性の解消を指示する。知識ベース部は、原言語における単語間の共起関係と、それに対応する相手言語の表現とを集める。単語シソーラス部は、意味的に類似した単語を記憶する。多義性解消部は、入力文字列を相手言語に翻訳するときに生じる多義性を解消する。そのため、多義性解消部は前記知識ベースを参照して訳語を検出し、検出できないときは前記単語シソーラス部中の意味類似単語に置き換えた原文で前記知識ベースを検索して訳語を検出し、さらに検出できないときは、訳語を頻度により決定する。翻訳結果出力部は、翻訳処理結果を出力する。このような構成により、訳語に多義性が発生したとき、知識ベースの規模が、単語シソーラスで補われて、等価的に大きな知識ベースをもとに多義性を解消することを実現している。 Furthermore, another example of the technique related to the polysemy extraction system is described in Patent Document 2 as “Machine Translation Device”. The machine translation device disclosed in Patent Document 2 includes an input unit, an input character string storage unit, a translation dictionary unit, a dictionary search unit, a translation processing unit, a knowledge base unit, a word thesaurus unit, an ambiguity elimination unit, and a translation result output. It consists of parts. The machine translation apparatus having such a configuration operates as follows.
The input unit inputs a source language character string. Next, the input character string storage unit stores the input character string. The translation dictionary section holds source language words and partner language form information, source language and partner language parallel translation information, and the like. The dictionary search unit searches the translation dictionary. When the translation processing unit translates the source language into another language with reference to the translation dictionary unit and identifies the ambiguity in the translation process, the translation processing unit instructs the ambiguity resolution unit to cancel the ambiguity. The knowledge base unit collects co-occurrence relationships between words in the source language and corresponding counterpart language expressions. The word thesaurus section stores semantically similar words. The ambiguity resolution unit eliminates the ambiguity that occurs when the input character string is translated into the partner language. Therefore, the ambiguity resolution unit detects a translation word by referring to the knowledge base, and when it cannot detect, it searches the knowledge base with the original sentence replaced with the semantically similar word in the word thesaurus part, detects a translation word, When it cannot be detected, the translated word is determined by frequency. The translation result output unit outputs the translation processing result. With such a configuration, when ambiguity occurs in a translated word, the scale of the knowledge base is supplemented by a word thesaurus, and the ambiguity is resolved based on an equivalently large knowledge base.

特開２００１−３３１５１５号公報JP 2001-331515 A 特開平０５−１５８９７０号公報JP 05-158970 A

上記のような技術の問題は、情報システム構築に関する提案書や仕様書等といった所定の案件に関する文書内で複数の意味を割り当てられている多義語の抽出に、例示した技術による多義語の抽出方法を適用すると、多義語の抽出率が低くなってしまうことが挙げられる。 The above-mentioned technical problems can be solved by extracting multiple meanings using the exemplified techniques in the extraction of multiple meanings assigned multiple meanings in documents related to a given project such as proposals and specifications related to information system construction. If is applied, the extraction rate of ambiguous words may be lowered.

その理由は、このような多義語が使用されている文書の多くは、一般的に文章量が限られているため任意の単語に対する共起語として同一の単語が出現する可能性が低く、大量のコーパスを前提とした特許文献１の手法で用いられているような共起語のクラスタリングを行うことが難しいためである。換言すれば、特許文献１の手法では、少量のコーパスに基づいて共起語をクラスタ化したとしても、所望の精度での多義語の抽出が行なえない課題を有している。 The reason for this is that many of the documents that use such polysemy are generally limited in the amount of sentences, so it is unlikely that the same word will appear as a co-occurrence word for an arbitrary word. This is because it is difficult to perform clustering of co-occurrence words as used in the method of Patent Document 1 on the assumption of the corpus of In other words, the technique of Patent Document 1 has a problem that even if co-occurrence words are clustered based on a small amount of corpus, it is not possible to extract a multiple meaning word with a desired accuracy.

また、別の観点での問題は、情報システム構築に関する提案書や仕様書等といった所定の案件に関する文書内で複数の意味を割り当てられている多義語の抽出に、例示した技術による多義語の抽出方法を適用すると、特定の案件に関する文書群で成り立つている多義語を抽出することができないことである。 Another problem is the extraction of multiple meanings using the exemplified technology in the extraction of multiple meanings that are assigned multiple meanings in documents related to a given project, such as proposals and specifications related to information system construction. When the method is applied, it is impossible to extract a polysemy consisting of a document group related to a specific case.

その理由は、このような多義語は、事前にその同義関係を把握することが難しく、特許文献２の手法で用いられているような翻訳辞書によって多義性のために訳しわけを行なう必要のある箇所の有無を判断することが困難であるためである。このため、特殊な多義語用に既存辞書とは別にシソーラスを準備するなどの対策が必要と成る。しかし、そのシソーラスを準備するためには多大な負担が必要となってしまう。 The reason for this is that it is difficult to grasp the synonym relations in advance for such ambiguous words, and it is necessary to translate them for ambiguousness using the translation dictionary used in the technique of Patent Document 2. This is because it is difficult to determine the presence or absence of a location. For this reason, it is necessary to take measures such as preparing a thesaurus separately from existing dictionaries for special polysemy. However, a great burden is required to prepare the thesaurus.

そこで、特定の範囲で用いられている独特な多義語を含む文書中からその多義語を所要に抽出することを課題とする。 Then, it makes it a subject to extract the polysemy required from the document containing the unique polysemy used in the specific range.

本発明の目的は、上記事柄に鑑み、情報システム構築に関する提案書や仕様書等といった所定の案件に関する文書内で複数の意味を割り当てられている多義語を抽出する、多義語抽出システム、方法およびプログラムを提供することにある。 In view of the above, the object of the present invention is to provide a multiple meaning extraction system, method, and method for extracting multiple meanings assigned with a plurality of meanings in a document related to a predetermined matter such as a proposal or specification regarding information system construction. To provide a program.

本発明に係る多義語抽出システムは、入力を受けた所定の文章に使用されている各単語の抽出を行う単語分析部と、前記単語の内で任意の単語を基軸単語として選択し、該基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルを抽出する基軸単語共起ベクトル抽出部と、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を一般概念から推定する共起語概念推定部と、推定した共起語概念群について、対応する共起語概念間の類似性に基づき、前記選択した基軸単語に関する各基軸単語共起語のクラスタリングを行う共起語分類部と、前記選択した基軸単語に関して複数のクラスタが存在した際に該基軸単語を多義語候補として抽出する多義語候補推定部と、抽出した多義語候補を出力する多義語候補出力部とを備えることを特徴とする。 The system for extracting multiple meanings according to the present invention includes a word analysis unit that extracts each word used in a predetermined sentence that has been input, selects an arbitrary word from the words as a base word, and the base A basic word co-occurrence vector extraction unit that extracts a basic word co-occurrence vector represented by a basic word co-occurrence word and the number of co-occurrence considered as a co-occurrence relationship with the word, and each basic word co-occurrence vector of the basic word co-occurrence vector The co-occurrence word concept estimation unit that estimates the co-occurrence word concept of the word from the general concept, and the estimated co-occurrence word concept group, each of the selected key word based on the similarity between the corresponding co-occurrence word concepts A co-occurrence word classifying unit that performs clustering of key word co-occurrence words, a polysemy candidate estimating unit that extracts the base word as a multi-word candidate when a plurality of clusters exist for the selected key word, and the extracted polysemy Output many word candidates Characterized in that it comprises a word candidate output unit.

本発明によれば、情報システム構築に関する提案書や仕様書等といった所定の案件に関する文書内で複数の意味を割り当てられている多義語を抽出する、多義語抽出システム、方法およびプログラムを提供できる。 ADVANTAGE OF THE INVENTION According to this invention, the multiple meaning word extraction system, method, and program which extract the multiple meaning word assigned with several meanings in the document regarding a predetermined | prescribed matter, such as a proposal regarding an information system construction, a specification, etc., can be provided.

第１の実施形態に係る多義語抽出システムの構成を示すブロック図である。It is a block diagram which shows the structure of the polysemy extraction system which concerns on 1st Embodiment. 図１に示した多義語抽出システムの動作例を示す流れ図である。It is a flowchart which shows the operation example of the multiple meaning word extraction system shown in FIG. 第２の実施形態に係る多義語抽出システムの構成を示すブロック図である。It is a block diagram which shows the structure of the polysemy extraction system which concerns on 2nd Embodiment. 図３に示した多義語抽出システムの動作例を示す流れ図である。It is a flowchart which shows the operation example of the multiple meaning word extraction system shown in FIG. 第１の実施例に係る多義語抽出システムの構成を示すブロック図である。It is a block diagram which shows the structure of the polysemy extraction system which concerns on a 1st Example. 基軸単語共起ベクトルＮｉをまとめた表の例を示す説明図である。It is explanatory drawing which shows the example of the table | surface which put together the basic word co-occurrence vector Ni. 基軸単語共起語Ｖｉｊに関するシソーラスの一般概念情報Ｃｇの分類体系例を示す説明図である。It is explanatory drawing which shows the example of a classification | category system of the general concept information Cg of the thesaurus regarding the basic word co-occurrence word Vij. 基軸単語「資材」に関する共起語概念図Ｃｖｗｊを樹形図として表した例である。It is the example which represented co-occurrence word conceptual diagram Cvwj regarding basic word "material" as a tree diagram. 周辺語構成表ＶＶの例を示す説明図である。It is explanatory drawing which shows the example of the peripheral word structure table VV. 周辺語Ｖｖｗｊｆに関するシソーラスの一般概念情報Ｃｇの分類体系の例を示す説明図である。It is explanatory drawing which shows the example of the classification system of the general concept information Cg of the thesaurus regarding the peripheral word Vvwjf. 基軸単語「資材」の共起語の周辺語Ｖｖｗｊｆに基づく大分類の共起語概念表ＶＣ１を示す説明図である。It is explanatory drawing which shows the large-category co-occurrence word concept table VC1 based on the peripheral word Vvwjf of the co-occurrence word of the basic word “material”. 基軸単語「資材」の共起語の周辺語Ｖｖｗｊｆに基づく中分類の共起語概念表ＶＣ２を示す説明図である。It is explanatory drawing which shows the co-occurrence word concept table | surface VC2 of the middle classification based on the peripheral word Vvwjf of the co-occurrence word of basic word "material". 基軸単語「資材」の共起語の周辺語Ｖｖｗｊｆに基づく小分類の共起語概念表ＶＣ３を示す説明図である。It is explanatory drawing which shows the co-occurrence word concept table | surface VC3 of the small classification based on the peripheral word Vvwjf of the co-occurrence word of basic word "material". 基軸単語「資材」に関する共起語概念図Ｃｖｗｊの樹形図に基づくクラスタリング結果の一例を示す説明図である。It is explanatory drawing which shows an example of the clustering result based on the dendrogram of co-occurrence word conceptual diagram Cvwj regarding basic word "material". 基軸単語「資材」に関する共起語概念図Ｃｖｗｊのデンドログラムに基づくクラスタリング結果の一例を示す説明図である。It is explanatory drawing which shows an example of the clustering result based on the dendrogram of co-occurrence word conceptual diagram Cvwj regarding basic word "material". 第２の実施例に係る多義語抽出システムの構成を示すブロック図である。It is a block diagram which shows the structure of the polysemy extraction system which concerns on a 2nd Example. 構成語「処理」を含む部分一致複合語共起表ＶＵｘの例を示す説明図である。It is explanatory drawing which shows the example of the partial coincidence compound word co-occurrence table VUx containing composition word "process." 構成語「変更」を含む部分一致複合語共起表ＶＵｘの例を示す説明図である。It is explanatory drawing which shows the example of the partial coincidence compound word co-occurrence table VUx containing a constituent word "change." 複合語「変更処理」に関する複合語構成配分表Ｔｅの例を示す説明図である。It is explanatory drawing which shows the example of the compound word structure distribution table Te regarding compound word "change process". 複合語を考慮した基軸単語共起ベクトルＮｉをまとめた表の例を示す説明図である。It is explanatory drawing which shows the example of the table | surface which put together the basic word co-occurrence vector Ni which considered the compound word. 複合語を考慮した基軸単語共起語Ｖｉｊに関するシソーラスの一般概念情報Ｃｇの分類体系例を示す説明図である。It is explanatory drawing which shows the example of a classification | category system of the thesaurus general concept information Cg regarding the basic word co-occurrence word Vij which considered the compound word. 基軸単語「資材」に関して、複合語を考慮した共起語概念図Ｃｖｗｊを樹形図として表した例である。It is the example which represented co-occurrence word conceptual diagram Cvwj which considered the compound word as a tree diagram regarding basic word "material". 基軸単語「資材」に関して、複合語を考慮した共起語概念図Ｃｖｗｊの樹形図に基づくクラスタリング結果の一例を示す説明図である。It is explanatory drawing which shows an example of the clustering result based on the tree diagram of co-occurrence word conceptual diagram Cvwj which considered the compound word regarding basic word "material".

［実施形態１］
最初に、本発明の第１の実施形態について、図面を参照して詳細に説明する。
図１は、本発明の第１の実施形態に係る多義語抽出システム１００の構成を示すブロック図である。 [Embodiment 1]
First, a first embodiment of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a polysemy extraction system 100 according to the first embodiment of the present invention.

図１を参照すると、本発明の第１の実施形態に係る多義語抽出システム１００は、基本的に電子機器内もしくはサーバと電子機器およびこれらを相互に接続するインターネット等の情報通信ネットワークからなるシステム内に、少なくとも、文書入力部１０、単語分析部２０、基軸単語共起ベクトル抽出部３０、共起語概念推定部４０、共起語分類部５０、多義語候補推定部６０、多義語候補出力部７０、概念データベース１１０と、を含む。
図示の多義語抽出システム１００は、情報システム構築に関する提案書や仕様書等といった所定の案件に関する文書内で複数の意味を割り当てられている多義語を抽出する多義語抽出システムである。 Referring to FIG. 1, a multiple meaning extraction system 100 according to the first embodiment of the present invention is basically a system comprising an electronic device or a server and an electronic device, and an information communication network such as the Internet for interconnecting them. Among them, at least the document input unit 10, the word analysis unit 20, the basic word co-occurrence vector extraction unit 30, the co-occurrence word concept estimation unit 40, the co-occurrence word classification unit 50, the multiple meaning word candidate estimation unit 60, and the multiple meaning word candidate output Unit 70 and concept database 110.
The illustrated multiple meaning word extraction system 100 is a multiple meaning word extraction system that extracts a multiple meaning word assigned a plurality of meanings in a document related to a predetermined item such as a proposal or specification regarding information system construction.

電子機器で多義語抽出システムを構成する場合、多義語抽出システム１００は、プログラム制御により動作するコンピュータで実現可能である。図示はしないが、この種のコンピュータは、周知のように、データを入力する入力装置と、データ処理装置と、データ処理装置での処理結果を出力する出力装置と、種々のデータベースとして働く補助記憶装置とを備えている。そして、データ処理装置は、プログラムを記憶するリードオンリメモリ（ＲＯＭ）と、データを一時的に記憶するワークエリアとして使用されるランダムアクセスメモリ（ＲＡＭ）と、ＲＯＭに記憶されたプログラムに従って、ＲＡＭに記憶されているデータを処理する中央処理装置（ＣＰＵ）とから構成される。
この場合、データ処理装置が、文書入力部１０、単語分析部２０、基軸単語共起ベクトル抽出部３０、共起語概念推定部４０、共起語分類部５０、多義語候補推定部６０として働き、補助記憶装置が概念データベース１１０として動作し、出力装置が多義語候補出力部７０として働く。 When a polysemy extraction system is comprised with an electronic device, the polysemy extraction system 100 is realizable with the computer which operate | moves by program control. Although not shown, this type of computer, as is well known, includes an input device for inputting data, a data processing device, an output device for outputting processing results in the data processing device, and an auxiliary memory serving as various databases. Device. Then, the data processing device stores data in a read-only memory (ROM) that stores a program, a random access memory (RAM) that is used as a work area that temporarily stores data, and a program stored in the ROM. It consists of a central processing unit (CPU) that processes stored data.
In this case, the data processing device functions as the document input unit 10, the word analysis unit 20, the basic word co-occurrence vector extraction unit 30, the co-occurrence word concept estimation unit 40, the co-occurrence word classification unit 50, and the multiple meaning word candidate estimation unit 60. The auxiliary storage device operates as the concept database 110, and the output device functions as the polysemy candidate output unit 70.

次に、多義語抽出システム１００を構成する各構成要素の動作について説明する。 Next, the operation of each component constituting the polysemy extraction system 100 will be described.

文書入力部１０は、多義語を抽出する対象とする文書もしくは文書群の入力を受け付ける。 The document input unit 10 receives an input of a document or a document group from which an ambiguous word is extracted.

単語分析部２０は、文書もしくは文書群を構成する各文章に形態素解析や構文解析などを適用することで、各文章に使用されている名詞、動詞、形容詞、形容動詞など単独で意味をなす自立語を単語として抽出し、さらに必要に応じて単語毎の品詞や直後に使用された助詞の種類、単語間の係り受け関係などの単語情報の抽出を行う。なお、自立語ではなく形態素をそのまま使用するようにしてもよい。 The word analysis unit 20 applies morphological analysis or syntactic analysis to each sentence constituting a document or a group of documents, so that nouns, verbs, adjectives, adjective verbs and the like used in each sentence are independent. Words are extracted as words, and word information such as part-of-speech for each word, type of particle used immediately after, and dependency relation between words is extracted as necessary. Note that morphemes may be used as they are instead of independent words.

基軸単語共起ベクトル抽出部３０は、単語分析部２０で抽出された各文章に使用されている任意の単語を基軸単語として順次選択し、基軸単語毎の単語情報などを用いて任意の基軸単語共起判定ルールで基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルをそれぞれ抽出する。ここで、前記基軸単語共起判定ルールとしては、基軸単語と係り受けの関係にある単語を共起語と見なすルールや、基軸単語と同一の文内で特定の助詞を伴って使用されている単語を共起語と見なすルールなどが考えられる。また、共起数は共起回数でも良いが、共起回数を基軸単語毎の全共起語数で除した頻度などでも良い。また、基軸単語共起語とその共起数について、抽出元とする所定文書について、重要度や確度、文書間の親子関係などに基づく重み付けを行なうようにしても良い。 The base word co-occurrence vector extraction unit 30 sequentially selects any word used in each sentence extracted by the word analysis unit 20 as a base word, and uses any word information or the like for each base word. A base word co-occurrence vector represented by a base word co-occurrence word that is regarded as a co-occurrence relationship with the base word in the co-occurrence determination rule and the number of the co-occurrence are extracted. Here, as the basic word co-occurrence determination rule, a word that has a dependency relationship with the basic word is used as a co-occurrence word, or a specific particle is used in the same sentence as the basic word. A rule that considers a word as a co-occurrence word can be considered. The number of co-occurrence may be the number of co-occurrence, but may be a frequency obtained by dividing the number of co-occurrence by the total number of co-occurrence words for each basic word. Further, with respect to the basic word co-occurrence word and the number of co-occurrence words, weighting may be performed on a predetermined document as an extraction source based on importance, accuracy, parent-child relationship between documents, and the like.

概念データベース１１０は、収集された単語の概念分類および一般的な同義語、類義語、用法などの一般概念情報を蓄積し、特定の単語に関する問い合わせに対して、単語の意味や用法に関連する一般概念情報を検索し応答するデータベースである。概念データベース１１０は、単語の上位/下位関係、部分/全体関係、同義関係、類義関係などによって単語を分類し、体系づけたシソーラスなどが相当する。概念データベース１１０としてインターネット上のデータベースを使用することとしてもよい。 The concept database 110 accumulates general concept information such as concept classification and general synonyms, synonyms, usages, etc. of collected words, and general concepts related to the meaning and usage of words in response to inquiries about specific words. A database that retrieves and responds to information. The concept database 110 corresponds to a thesaurus that organizes and organizes words according to the upper / lower relationship, partial / whole relationship, synonym relationship, synonym relationship, and the like of words. A database on the Internet may be used as the concept database 110.

共起語概念推定部４０は、概念データベース１１０の一般概念情報を利用して、所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する。 The co-occurrence word concept estimation unit 40 uses the general concept information in the concept database 110 to estimate the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector based on a predetermined concept estimation method.

前記概念推定方法としては、直接、各基軸単語共起語に関する一般概念情報を概念データベース１１０に問い合わせ、特定の基軸単語の全基軸単語共起語を一般概念情報に基づく一般概念に置き換えた基軸単語共起概念ベクトルを共起語概念とする方法が良い。概念への置き換えで異なる基軸単語共起語が同じ一般概念となる場合はそれぞれの基軸単語共起語を合流し、共起数の和を対応箇所へ登録する。また、概念データベース１１０として大分類、中分類、小分類のような複数の階層での概念が一般概念情報として登録されたシソーラスを用いる場合、階層毎に基軸単語共起概念ベクトルを作成し、大分類など広い概念での基軸単語共起概念ベクトルで異なる共起語が同じ概念となる場合は、それぞれの共起語を合流させて、共起数の和を対応箇所へ登録する。他に、概念データベース１１０として同義語を含む類義語群が一般概念情報として登録された類語辞書を用いた場合、基軸単語共起語を対応する類義語群の各類義語に変換し、各類義語の共起数として対応する基軸単語共起語の共起数を割り当て、同一の基軸単語の基軸単語共起語に関して変換された類義語毎の共起数の延べ数を基軸単語共起概念ベクトルとして算出しても良い。なお、概念データベース１１０に基軸単語共起語に対応する概念が無い場合、前記共起語を概念に変換せず、共起語の単語をそのまま概念として扱い残す。 As the concept estimation method, the basic word is directly inquired of the concept database 110 for general concept information about each basic word co-occurrence word, and all basic word co-occurrence words of a specific basic word are replaced with the general concept based on the general concept information. A method of using a co-occurrence concept vector as a co-occurrence word concept is preferable. When different basic word co-occurrence words become the same general concept by replacement with the concept, the respective basic word co-occurrence words are merged, and the sum of the co-occurrence numbers is registered in the corresponding location. Further, when a thesaurus in which concepts at a plurality of levels such as major classification, middle classification, and minor classification are registered as general concept information is used as the concept database 110, a basic word co-occurrence concept vector is created for each hierarchy. When different co-occurrence words become the same concept in the basic word co-occurrence concept vector in a broad concept such as classification, the respective co-occurrence words are merged and the sum of the co-occurrence numbers is registered in the corresponding location. In addition, when a synonym dictionary in which synonyms including synonyms are registered as general concept information is used as the concept database 110, the key word co-occurrence words are converted into the corresponding synonyms of the corresponding synonyms and the co-occurrence of the synonyms The co-occurrence number of the corresponding base word co-occurrence word is assigned as a number, and the total number of co-occurrence numbers for each synonym converted with respect to the base word co-occurrence word of the same base word is calculated as the base word co-occurrence concept vector good. If there is no concept corresponding to the basic word co-occurrence word in the concept database 110, the co-occurrence word is not converted to the concept, and the word of the co-occurrence word is treated as a concept as it is.

また前記概念推定方法の他の例としては、基軸単語共起語について任意の周辺語判定ルールで基軸単語共起語の周辺に存在する周辺語とその存在数に基づく周辺語構成ベクトルを全基軸単語共起語についてまとめた周辺語構成表を作成し、周辺語構成表の周辺語構成ベクトルにおける各周辺語のそれぞれについて、概念データベース１１０に一般概念情報を問い合わせ、任意の範囲内で周辺語構成表における各周辺語構成ベクトルの各周辺語を一般概念に変換した周辺語概念ベクトルを対応する基軸単語共起語毎に作成し、特定の基軸単語の全基軸単語共起語に対応する周辺概念ベクトルをまとめた基軸単語共起概念表を共起語概念とする方法でも良い。
ここで、前記周辺語判定ルールとしては１文、１段落内の全文章、目次上の同一項目内での全文章、文書全体など、文書の特徴に合わせて周辺と見なす範囲を設定して良く、１文内で共存する動詞、および目次上の同一項目内の文章内の名詞のように品詞毎に周辺とみなす範囲を変えても良い。さらに、単語間の係り受け関係のある単語かどうかを前記周辺語判定ルールとして利用しても良い。また、存在数は存在個数でも良いが、存在個数を基軸単語共起語毎の全周辺語数で除した頻度などでも良い。また、周辺語構成表とは各行が各基軸単語共起語に、各列が各周辺語に対応している行列で、基軸単語共起語に対する周辺語の存在数が表の各値として登録されたものである。概念への変換で異なる周辺語が同じ概念となる場合はそれぞれの周辺語を合流し、存在数の和を対応箇所へ登録する。また、概念データベース１１０として大分類、中分類、小分類のような複数の階層での概念が一般概念情報として登録されたシソーラスを用いる場合、階層毎に基軸単語共起概念表を作成し、大分類など広い概念での基軸単語共起概念表で異なる周辺語が同じ概念となる場合は、それぞれの周辺語を合流し、存在数の和を対応箇所へ登録する。他に、概念データベース１１０として同義語を含む類義語群が一般概念情報として登録された類語辞書を用いた場合、周辺語を対応する類義語群の各類義語に変換し、各類義語の存在数として対応する周辺語の存在数を割り当て、同一の基軸単語共起語の周辺語に関して変換された類義語毎の共起数の延べ数を周辺語概念ベクトルとして算出し、特定の基軸単語の全基軸単語共起語に対応する周辺概念ベクトルをまとめることで基軸単語共起概念表を作成しても良い。なお、概念データベース１１０に周辺語に対応する概念が無い場合、前記共起語を概念に変換せず、共起語の単語をそのまま仮の概念として扱い残す。 In addition, as another example of the concept estimation method, the peripheral word composition vector based on the peripheral words existing in the vicinity of the basic word co-occurrence word and the number of existing peripheral words based on the arbitrary peripheral word determination rule for the basic word co-occurrence word A peripheral word composition table summarizing the word co-occurrence words is created, the general concept information is inquired to the concept database 110 for each of the peripheral words in the peripheral word composition vector of the peripheral word composition table, and the peripheral word composition within an arbitrary range Peripheral concept vectors created by converting each peripheral word of each peripheral word composition vector in the table into general concepts for each corresponding basic word co-occurrence word, and corresponding peripheral concepts corresponding to all basic word co-occurrence words of a specific basic word A basic word co-occurrence concept table in which vectors are collected may be used as a co-occurrence word concept.
Here, as the peripheral word determination rule, a range to be regarded as a peripheral according to the characteristics of the document, such as one sentence, all sentences within one paragraph, all sentences within the same item on the table of contents, or the whole document, may be set. The range regarded as the periphery may be changed for each part of speech such as a verb coexisting in one sentence and a noun in a sentence in the same item on the table of contents. Further, whether or not a word has a dependency relationship between words may be used as the peripheral word determination rule. The existence number may be the existence number, or may be a frequency obtained by dividing the existence number by the total number of peripheral words for each basic word co-occurrence word. The peripheral word composition table is a matrix in which each row corresponds to each basic word co-occurrence word and each column corresponds to each peripheral word, and the number of peripheral words for the basic word co-occurrence word is registered as each value of the table. It has been done. When different peripheral words become the same concept in the conversion to the concept, the respective peripheral words are merged and the sum of the number of existences is registered in the corresponding part. Further, when a thesaurus in which concepts at a plurality of levels such as major classification, middle classification, and minor classification are registered as general concept information is used as the concept database 110, a basic word co-occurrence concept table is created for each hierarchy. When different peripheral words become the same concept in the basic word co-occurrence concept table in a broad concept such as classification, the respective peripheral words are merged and the sum of the number of existences is registered in the corresponding location. In addition, when a synonym dictionary in which synonyms including synonyms are registered as general concept information is used as the concept database 110, peripheral words are converted into the corresponding synonyms of the corresponding synonyms and corresponding as the number of existing synonyms. Assigns the number of neighboring words, calculates the total number of co-occurrence for each synonym converted for the neighboring words of the same basic word co-occurrence word as a peripheral word concept vector, and all the basic word co-occurrence words of a specific basic word A basic word co-occurrence concept table may be created by collecting peripheral concept vectors corresponding to. If there is no concept corresponding to the peripheral word in the concept database 110, the co-occurrence word is not converted into a concept, and the word of the co-occurrence word is left as a temporary concept.

共起語分類部５０は、特定の基軸単語に関する基軸単語共起語の各共起語概念について、所定の類似性指標によって対応する共起語概念間の類似性を算出し、前記共起語概念間の類似性指標に基づき各基軸単語共起語のクラスタリングを行う。ここで、類似性の判定を行う「類似性指標」は共起語概念間の意味的な類似性を判断する基準であれば良く、例えば、共起語概念が、特定の基軸単語の全基軸単語共起語を一般概念情報に基づく一般概念に置き換えた基軸単語共起概念ベクトルで、概念データベース１１０として前記シソーラスを用いる場合、全基軸単語共起語が同一の一般概念と見なされるまでの分類の深さが類似性指標として有効であり、概念データベース１１０として前記類語辞書を用いた場合、基軸単語共起語に関して変換された類義語毎の共起数の延べ数で作成した基軸単語共起概念ベクトル間のコサイン距離やユークリッド距離などの距離と単調減少の関係にある関数値が類似性指標として適当である。また、共起語概念が、特定の基軸単語の全基軸単語共起語に対応する前記周辺語概念ベクトルをまとめた前記基軸単語共起概念表で、概念データベース１１０として前記シソーラスを用いる場合、階層毎に各基軸単語共起語に対応する前記周辺語概念ベクトル間のコサイン距離やユークリッド距離などを算出し、小分類などより詳細な深い分類での距離ほど重視するように重み付けした距離と単調減少の関係にある関数値が類似性指標として適切である。また、クラスタリングの手法は一般的な手法で良く、デンドログラムなどを用いた階層的クラスタリングを適用しても良いし、周辺単語間の類似性と単調減少する指標を距離として導かれる仮想的な周辺単語の位置情報にｋ−ｍｅａｎｓ法やＦｕｓｓｙ−ｃ−ｍｅａｎｓ法などの非階層的クラスタリングを適用しても良い。 The co-occurrence word classifying unit 50 calculates the similarity between corresponding co-occurrence word concepts for each co-occurrence word concept of the basic word co-occurrence word related to the specific basic word by a predetermined similarity index, and the co-occurrence word Based on the similarity index between concepts, each key word co-occurrence word is clustered. Here, the “similarity index” for determining similarity may be a standard for determining the semantic similarity between co-occurrence word concepts. For example, the co-occurrence word concept includes all basic axes of a specific basic word. A basic word co-occurrence concept vector in which a word co-occurrence word is replaced with a general concept based on general concept information, and when the thesaurus is used as the concept database 110, classification until all basic word co-occurrence words are regarded as the same general concept When the synonym dictionary is used as the concept database 110, the basic word co-occurrence concept vector created by the total number of co-occurrence numbers for each synonym converted with respect to the basic word co-occurrence word is used. A function value that is in a monotonically decreasing relationship with a distance such as a cosine distance or a Euclidean distance is suitable as the similarity index. Further, when the thesaurus is used as the concept database 110 in the basic word co-occurrence concept table in which the co-occurrence word concept is a summary of the peripheral word concept vectors corresponding to all the basic word co-occurrence words of a specific basic word, Calculates cosine distance and Euclidean distance between the neighboring word concept vectors corresponding to each basic word co-occurrence word every time, and the weighting distance and monotonously decrease so that the distance in the deeper classification such as small classification is more important A function value in the relationship is appropriate as a similarity index. The clustering method may be a general method, and hierarchical clustering using dendrograms may be applied, or a virtual periphery where similarity between neighboring words and a monotonically decreasing index are derived as distances. Non-hierarchical clustering such as the k-means method or the Fussy-c-means method may be applied to the word position information.

多義語候補推定部６０は、基軸単語としたそれぞれの単語について、それぞれの基軸単語に関する各基軸単語共起語のクラスタリング結果から、クラスタの規模が任意に定めた閾値以上である複数のクラスタが存在する基軸単語を、意味的に複数の用法が見られるとして、多義語の可能性が想定される多義語候補として抽出する。なお、前記クラスタの規模としては、クラスタに帰属する各基軸単語共起語の共起数などを指標とすれば良い。 The multiple-sense word candidate estimation unit 60 has, for each word as a base word, a plurality of clusters in which the size of the cluster is equal to or greater than an arbitrarily determined threshold from the clustering result of each base word co-occurrence word for each base word. The key word to be extracted is extracted as a candidate for a multiple word that is assumed to be a multiple word, assuming that a plurality of usages are semantically seen. In addition, as the scale of the cluster, the number of co-occurrence of each basic word co-occurrence word belonging to the cluster may be used as an index.

多義語候補出力部７０は、多義語候補推定部６０で抽出した多義語候補を出力する。ここで、出力形態は、所要の形態で出力すればよく、文書内における多義語候補の基軸単語を色分けや太字による強調などで明示することで、文書全体を出力する形態などが適当である。他にも、出力形態としては、多義語候補の組合せを抽出した表などの形態であって良い。また、他の出力形態としては、多義語候補とされた基軸単語を主ノード、その基軸単語共起語の概念に基づく各クラスタを中間ノード、各クラスタに帰属する基軸単語共起語を端ノードとして関係をリンクで結んだグラフを表示し、共起数の多いリンクを色分けして強調するなどの形態であって良い。また、出力形態としては、多義語候補を抽出する際に用いた類似性指標などで多義語間に定量的な多義度を付加し、多義度が任意に設定された閾値より大きい多義語のみに表示を限定しても良い。もしくは、出力形態としては、多義語候補間の多義度によって色分けや太字による強調もしくはグラフの単語の文字の大きさなどに強弱を与えるなどしても良い。また、各出力形態を選択できるようにして、ベースとなる表示形態から必要に応じて表やグラフに移行できるようにしてもよい。また、必要に応じて動詞や名詞などを選択的に出力するようにしてもよい。
次に、図１、及び図２に示すシーケンスを参照して、第１の実施形態に係る多義語抽出システム１００の全体の動作について詳細に説明する。なお、図２に示す流れ図および以下の説明は処理例であり、適宜求める効果に応じて処理順等を入れ替えたり処理を戻したり繰り返したりすることを行ってもよい。 The multiple word candidate output unit 70 outputs the multiple word candidate extracted by the multiple word candidate estimation unit 60. Here, the output form may be output in a required form, and an appropriate form is to output the entire document by clearly indicating the key word of the multiple meaning word in the document by color coding or bold emphasis. In addition, the output form may be a form such as a table from which combinations of multiple meaning words are extracted. Also, as other output forms, the main word is a base word that is a multiple word candidate, each cluster based on the concept of the base word co-occurrence word is an intermediate node, and the base word co-occurrence word belonging to each cluster is an end node For example, a graph in which relationships are linked by links may be displayed, and links with a large number of co-occurrence numbers may be color-coded and emphasized. In addition, as an output form, a quantitative ambiguity is added between the ambiguities with the similarity index used when extracting the ambiguity candidates, and only the ambiguities where the ambiguity is larger than the arbitrarily set threshold. The display may be limited. Or as an output form, depending on the degree of ambiguity between the ambiguity candidates, color coding, emphasis by bold letters, or the size of the characters of the words in the graph may be given. Further, each output form may be selected so that the display form as a base can be shifted to a table or a graph as necessary. Moreover, you may make it selectively output a verb, a noun, etc. as needed.
Next, with reference to the sequence shown in FIG. 1 and FIG. 2, the overall operation of the polysemy extraction system 100 according to the first embodiment will be described in detail. Note that the flowchart shown in FIG. 2 and the following description are examples of processing, and the processing order and the like may be changed or the processing may be returned or repeated depending on the desired effect.

文書入力部１０は、対象とする文書もしくは文書群の入力を受け付ける（図２のステップＡ１）。
単語分析部２０は、文書もしくは文書群を構成する各文章に形態素解析や構文解析などを適用することで、各文章に使用されている名詞、動詞、形容詞、形容動詞など単独で意味をなす自立語を単語として抽出し、さらに単語毎の品詞や直後に使用された助詞の種類、単語間の係り受け関係などの単語情報の抽出を行う（ステップＡ２）。 The document input unit 10 receives an input of a target document or document group (step A1 in FIG. 2).
The word analysis unit 20 applies morphological analysis or syntactic analysis to each sentence constituting a document or a group of documents, so that nouns, verbs, adjectives, adjective verbs and the like used in each sentence are independent. Words are extracted as words, and word information such as part-of-speech for each word, type of particle used immediately after, and dependency relation between words is extracted (step A2).

基軸単語共起ベクトル抽出部３０は、単語分析部２０で抽出された各文章に使用されている任意の単語を基軸単語として選択し、基軸単語毎の単語情報に基づき、所定の基軸単語共起判定ルールで基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルを抽出する（ステップＡ３）。 The basic word co-occurrence vector extraction unit 30 selects an arbitrary word used in each sentence extracted by the word analysis unit 20 as a basic word, and based on word information for each basic word, a predetermined basic word co-occurrence A base word co-occurrence vector represented by a base word co-occurrence word that is regarded as a co-occurrence relationship with the base word in the determination rule and the number of the co-occurrence are extracted (step A3).

概念データベース１１０は、収集蓄積されている単語の概念分類および一般的な同義語、類義語、用法などの一般概念情報から、特定の単語に関する問い合わせに対して、適宜単語の意味や用法に関連する一般概念情報を検索し応答する（ステップＡ４）。 The concept database 110 collects and accumulates general concepts related to the meaning and usage of words based on the concept classification of words and general conceptual information such as general synonyms, synonyms, and usages. The conceptual information is retrieved and responded (step A4).

共起語概念推定部４０は、概念データベース１１０の一般概念情報を利用して、所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語について個々の共起語概念を推定する（ステップＡ５）。 The co-occurrence word concept estimation unit 40 estimates the individual co-occurrence word concept for each basic word co-occurrence word of the basic word co-occurrence vector based on a predetermined concept estimation method using the general concept information of the concept database 110. (Step A5).

共起語分類部５０は、特定の基軸単語に関する各基軸単語共起語について、推定した個々の共起語概念を参照することにより対応する前記共起語概念間の類似性を所定の類似性指標によって算出し、その共起語概念間の類似性指標に基づき各基軸単語共起語のクラスタリングを行う（ステップＡ６）。 The co-occurrence word classifying unit 50 refers to each estimated co-occurrence word concept for each basic word co-occurrence word related to a specific basic word, and determines the similarity between the corresponding co-occurrence word concepts as a predetermined similarity. The basic word co-occurrence words are clustered based on the similarity index between the co-occurrence word concepts (step A6).

多義語候補推定部６０は、特定の基軸単語に関する各基軸単語共起語のクラスタリング結果から、各クラスタの規模が任意に定めた閾値以上のクラスタが複数存在する基軸単語を、意味的に複数の用法が見られ多義語の可能性が想定される多義語候補として順次抽出する（ステップＡ７）。 From the clustering result of each basic word co-occurrence word related to a specific basic word, the polysemy candidate estimation unit 60 semantically selects a plurality of basic words that have a plurality of clusters whose size of each cluster is equal to or greater than a predetermined threshold. The candidate is sequentially extracted as a candidate for a multiple meaning word that can be used and a possibility of a multiple meaning word is assumed (step A7).

多義語候補出力部７０は、多義語候補推定部６０で抽出できた多義語候補を出力する（ステップＡ８）。 The multiple word candidate output unit 70 outputs the multiple word candidates that can be extracted by the multiple word candidate estimation unit 60 (step A8).

次に、本発明の第１の実施形態に係る多義語抽出システム１００の効果について説明する。
本第１の実施形態では、文書内もしくは文書群内の基軸単語共起語を共起語概念に変換することによって、意味的には類似するが単語としては一致しない共起語をクラスタリングした結果に基づいて多義語候補を抽出するように構成しているため、各基軸単語共起語の出現回数が少なく基軸単語共起語間の距離が０となりがちな文章量の少ない条件でも基軸単語について用法のパターンが複数あるかどうかの把握が可能になり、所定の案件に関する文書内で複数の意味を割り当てられている多義語を精度よく抽出できる。
尚、上記第１の実施形態に係る多義語抽出システム１００は、多義語抽出方法として実現され得る。また、上記第１の実施形態に係る多義語抽出システム１００は、多義語抽出プログラムによりコンピュータによって実行させるようにしても良い。 Next, the effect of the polysemy extraction system 100 according to the first embodiment of the present invention will be described.
In the first embodiment, a result of clustering co-occurrence words that are semantically similar but do not match as a word by converting a basic word co-occurrence word in a document or document group into a co-occurrence word concept Because it is configured to extract the ambiguous word candidates based on the basic word, even if the number of occurrences of each basic word co-occurrence word is small and the distance between the basic word co-occurrence words tends to be zero, It is possible to grasp whether or not there are a plurality of usage patterns, and it is possible to accurately extract a polysemy assigned a plurality of meanings in a document related to a predetermined case.
Note that the polysemy extraction system 100 according to the first embodiment can be realized as a polysemy extraction method. In addition, the polysemy extraction system 100 according to the first embodiment may be executed by a computer using a polysemy extraction program.

［実施形態２］
次に、第２の実施形態について、図面を参照して詳細に説明する。
図３は、第３の実施形態に係る多義語抽出システム１００Ａの構成を示すブロック図である。 [Embodiment 2]
Next, a second embodiment will be described in detail with reference to the drawings.
FIG. 3 is a block diagram showing a configuration of a multiple meaning word extraction system 100A according to the third embodiment.

図３を参照すると、第２の実施形態に係る多義語抽出システム１００Ａは、構成語支配度算出部３５と、複合語構成配分推定部３６と、を更に含むと共に、後述するように単語分析部と共起語概念推定部の動作が相違する点を除いて、図１に示した第１の実施形態に係る多義語抽出システム１００と同様の構成を有し、動作をする。したがって、単語分析部に２０Ａの参照符号を付し、共起語概念推定部に４０Ａの参照符号を付してある。 Referring to FIG. 3, the polysemy extraction system 100 </ b> A according to the second embodiment further includes a constituent word dominance calculation unit 35 and a compound word constituent distribution estimation unit 36, and a word analysis unit as will be described later. Except that the operation of the co-occurrence word concept estimation unit is different from that of the multiple meaning word extraction system 100 according to the first embodiment shown in FIG. Therefore, the reference symbol of 20A is attached to the word analysis unit, and the reference symbol of 40A is attached to the co-occurrence word concept estimation unit.

図示の多義語抽出システム１００Ａを上述したコンピュータで実現した場合、データ処理装置が、文書入力部１０、単語分析部２０Ａ、基軸単語共起ベクトル抽出部３０、構成語支配度算出部３５、複合語構成配分推定部３６、共起語概念推定部４０Ａ、共起語分類部５０、多義語候補推定部６０として働き、補助記憶装置が概念データベース１１０として動作し、出力装置が多義語候補出力部７０として働く。
そして、単語分析部２０Ａが文書中の単語の中の複合語および複合語の構成語を取得し、構成語支配度算出部３５が、複合語の構成語毎の構成語支配度を算出し、複合語構成配分推定部３６が、構成語支配度に基づき複合語の構成語毎の概念に重み付けを行った複合語構成配分表を作成し、共起語概念推定部４０Ａが、基軸単語共起語を概念に変換する前に、基軸単語共起ベクトルの中で複合語となっている基軸単語共起語の共起数を複合語構成配分表に基づいて分配した共起数に換算を行う。 When the illustrated multiple meaning word extraction system 100A is realized by the above-described computer, the data processing apparatus includes a document input unit 10, a word analysis unit 20A, a basic word co-occurrence vector extraction unit 30, a constituent word dominance calculation unit 35, a compound word It functions as the composition distribution estimation unit 36, the co-occurrence word concept estimation unit 40A, the co-occurrence word classification unit 50, and the multiple meaning word candidate estimation unit 60, the auxiliary storage device operates as the concept database 110, and the output device is the multiple meaning word candidate output unit 70. Work as.
Then, the word analysis unit 20A acquires a compound word and a component word of the compound word in the word in the document, and a component word dominance calculation unit 35 calculates a component word dominance for each component word of the compound word, The compound word composition distribution estimation unit 36 creates a compound word composition distribution table in which the concept for each component word of the compound word is weighted based on the component word dominance degree, and the co-occurrence word concept estimation unit 40A performs the basic word co-occurrence Before converting a word to a concept, the co-occurrence number of the basic word co-occurrence word that is a compound word in the basic word co-occurrence vector is converted to the co-occurrence number distributed based on the compound word composition distribution table. .

次に、多義語抽出システム１００Ａを構成する各構成要素の動作について説明する。 Next, the operation of each component constituting the polysemy extraction system 100A will be described.

単語分析部２０Ａは、図１に示した単語分析部２０の動作に加え、抽出された各単語の一般概念情報を概念データベース１１０に問い合わせ、概念データベース１１０に登録が無く、かつ文字数が２文字以上の単語を複合語として抽出する点で、図１に示した単語分析部２０と異なる。さらに単語分析部２０Ａは、複合語を構成するあらゆる部分文字列について、概念データベース１１０に一般概念情報を問い合わせ、一般概念情報の登録がある部分文字列を複合語の有意構成語として抽出し、抽出した有意構成語を元の複合語から分離した場合に概念データベース１１０に一般概念情報の登録が無い部分文字列が残る場合は不明構成語として抽出する点で、図１に示した単語分析部２０と異なる。
なお複合語を構成する部分文字列の内、概念データベース１１０に一般概念情報の登録がある部分文字列の組合せパターンが複数考えられる場合は、任意の構成語分離ルールに基づいて最適な組合せパターンを判定し、その組合せパターンでの有意構成語、不明構成語を抽出する。ここで、構成語分離ルールとしては、不明構成語の文字数が最も少なくなるパターンを優先するルールや、入力された文書中に単独の単語として出現する頻度が高い有意構成語を優先するルール、一般の文書中に単独の単語として出現する頻度が高い有意構成語を優先するルール、およびこれらを組合せたルールなどが有効である。また、入力された文書中に含まれる他の複合語に共通して使用されている文字列が所定頻度以上に使用されている場合にはその文字列を除いた残りの文字列について、有意構成語として優先するルールを用いてもよい。
なお、一般概念情報とはシソ−ラスにおける分類や、単語の意味を直接的に表すキーワード、類語の集合などが考えられる。
なお、以下で単に構成語と記載した場合は有意構成語と不明構成語を含む。 In addition to the operation of the word analysis unit 20 shown in FIG. 1, the word analysis unit 20A inquires the concept database 110 for general concept information of each extracted word, is not registered in the concept database 110, and has two or more characters Is different from the word analysis unit 20 shown in FIG. 1 in that it is extracted as a compound word. Further, the word analysis unit 20A inquires the concept database 110 for general concept information for every partial character string constituting the compound word, extracts a partial character string in which the general concept information is registered as a significant component word of the compound word, and extracts it. When the significant constituent word is separated from the original compound word, if a partial character string for which general concept information is not registered remains in the concept database 110, the word analyzing unit 20 shown in FIG. And different.
Of the partial character strings constituting a compound word, when there are a plurality of combination patterns of partial character strings for which general concept information is registered in the concept database 110, an optimal combination pattern is selected based on an arbitrary constituent word separation rule. Judgment is made, and significant constituent words and unknown constituent words in the combination pattern are extracted. Here, as a constituent word separation rule, a rule that prioritizes a pattern that minimizes the number of characters of unknown constituent words, a rule that prioritizes significant constituent words that frequently appear as a single word in the input document, Rules that prioritize significant constituent words that frequently appear as single words in the document and rules that combine these are effective. In addition, if a character string that is used in common with other compound words included in the input document is used more than the specified frequency, the remaining character strings excluding the character string are significantly composed. Rules that have priority over words may be used.
The general concept information includes a thesaurus classification, a keyword that directly represents the meaning of a word, a set of synonyms, and the like.
In the following description, a simple constituent word includes a significant constituent word and an unknown constituent word.

構成語支配度算出部３５は、単語分析部２０Ａで抽出された各文章に使用されている単語および複合語に基づき、任意の複合語共起判定ルールで複合語と共起する単語を複合語共起語として、複合語毎に複合語共起語とその共起数を抽出し、これらをまとめることで複合語共起表を作成する。
ここで、複合語共起判定ルールとしては１文、１段落内の全文章、目次上の同一項目内での全文章、文書全体、文書のタイトル、文書群の中での位置付けなど、文書の特徴に合わせて選択して良い。例えば、品詞が動詞であれば１文内での共起、名詞であれば目次上の同一項目内での全文章内共起のように品詞毎に文書群の範囲を変えるようにすれば良い。
また、共起数は共起回数でも良いが、共起回数を複合語毎の全共起語数で除した頻度などでも良い。
さらに、単語情報に単語間の係り受け関係が含まれる場合は、係り受け関係のある単語かどうかを複合語共起判定ルールとして利用しても良い。
また、複合語共起表とは各行が各複合語に、各列が各複合語共起語に対応している行列で、複合語に対する複合語共起語の共起数が表の各値として登録されたものである。 The constituent word dominance calculation unit 35 determines a word that co-occurs with a compound word in an arbitrary compound word co-occurrence determination rule based on the word and compound word used in each sentence extracted by the word analysis unit 20A. As a co-occurrence word, a compound word co-occurrence word and the number of co-occurrence are extracted for each compound word, and a compound word co-occurrence table is created by collecting them.
Here, the compound word co-occurrence determination rule includes one sentence, all sentences in one paragraph, all sentences in the same item on the table of contents, the whole document, document title, position in the document group, etc. You may choose according to the characteristics. For example, if the part of speech is a verb, co-occurrence within one sentence, and if it is a noun, the range of the document group may be changed for each part of speech, such as co-occurrence within all sentences within the same item on the table of contents. .
The number of co-occurrence may be the number of co-occurrence, but may be a frequency obtained by dividing the number of co-occurrence by the total number of co-occurrence words for each compound word.
Further, when the word information includes a dependency relationship between words, whether or not the word has a dependency relationship may be used as a compound word co-occurrence determination rule.
The compound word co-occurrence table is a matrix in which each row corresponds to each compound word and each column corresponds to each compound word co-occurrence word. It is registered as.

さらに、構成語支配度算出部３５は、複合共起表と単語分析部２０Ａで抽出された構成語に基づき、その複合語共起表から同じ構成語を含む部分一致複合語の複合語共起語からなる複合語共起ベクトルを抽出し、構成語別に部分一致複合語共起表を作成する。そして、部分一致複合語共起表の複合語共起ベクトルから得られる共起ベクトル空間における各部分一致複合語間の集約度を構成語支配度として算出する。
ここで、共起ベクトル空間は各ベクトルを対等としても良いが、複合語共起語の品詞によって重み付けを行ったベクトル空間に変換しても良い。また、各部分一致複合語間の集約度とは各部分一致複合語に対応するベクトル間の散らばりの小ささを表す指標であればどのような算出方法によっても良い。例えば分散や標準偏差、変動係数などの一般に統計で用いられるばらつきを示す指標と単調減少の関係にある関数であればよく、分散の逆数や変動係数の逆数などが適している。 Further, the constituent word dominance calculating unit 35 is based on the constituent words extracted by the compound co-occurrence table and the word analyzing unit 20A, and the compound word co-occurrence of partially matching compound words including the same constituent word from the compound word co-occurrence table. A compound word co-occurrence vector consisting of words is extracted, and a partially matched compound word co-occurrence table is created for each constituent word. Then, the degree of aggregation between the partial coincidence compound words in the co-occurrence vector space obtained from the compound word co-occurrence vector of the partial coincidence compound word co-occurrence table is calculated as the constituent word dominance.
Here, the co-occurrence vector space may be equivalent to each vector, but may be converted into a vector space weighted by the part of speech of the compound word co-occurrence word. In addition, the degree of aggregation between each partially matched compound word may be any calculation method as long as it is an index representing the degree of dispersion between vectors corresponding to each partially matched compound word. For example, a function having a monotonous decrease relationship with an index indicating dispersion generally used in statistics, such as dispersion, standard deviation, and variation coefficient, may be used, and an inverse of dispersion or an inverse of variation coefficient is suitable.

複合語構成配分推定部３６は、構成語支配度算出部３５で算出した各構成語支配度で複合語毎の各構成語間の構成語重み付け係数を算出し、構成語重み付け係数をまとめた複合語構成配分表を作成する。
複合語概念構成表とは、各行が各複合語に、各列が複合語の各構成語に対応した行列で、対応する構成語重み付け係数が登録されたものである。
ここで、構成語重み付け係数の算出方法としては、各構成語の構成語支配度を複合語毎の構成語支配度の総和で除すことで正規化した値を指標とする方法などが有効である。 The compound word composition distribution estimation unit 36 calculates a composition word weighting coefficient between the composition words for each compound word with each composition word dominance calculated by the composition word dominance degree calculation unit 35, and combines the composition word weighting coefficients. Create a word structure distribution table.
The compound word concept composition table is a matrix in which each row corresponds to each compound word and each column corresponds to each composition word of the compound word, and corresponding composition word weighting coefficients are registered.
Here, as a method for calculating a constituent word weighting coefficient, a method that uses a normalized value as an index by dividing the constituent word dominance of each constituent word by the sum of the constituent word dominance for each compound word is effective. is there.

共起語概念推定部４０Ａは、上記説明した共起語概念推定部４０の動作に加え、基軸単語共起ベクトル抽出部３０で作成された基軸単語共起表の基軸単語共起ベクトルの各共起語の内で複合語になっている基軸単語共起語について、複合語構成配分推定部３６で作成した複合語構成配分表に基づく係数を使用して、所要の推定方法に合致させて各複合語に適する共起語概念を推定する。一例としては、共起語概念推定部４０Ａに、各複合語を構成する各構成語をそれぞれ基軸単語共起語として独立させ、複合語構成配分推定部３６で作成した複合語構成配分表に基づき、前記基軸単語共起語の共起数に各構成語の構成語重み付け係数を掛けて算出した値を各構成語の共起数として基軸単語共起ベクトルを変更し、前記所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する。
なお、使用する概念推定方法として、複合語を含む周辺語を考慮し、特定の基軸単語の全基軸単語共起語に対応する周辺概念ベクトルをまとめた基軸単語共起概念表を共起語概念とする場合、前記基軸単語共起表の周辺語構成ベクトルの各周辺語の内で複合語になっている周辺語について、各構成語をそれぞれ周辺語として独立させ、複合語構成配分推定部３６で作成した複合語構成配分表に基づき、前記周辺語の存在数に各構成語の構成語重み付け係数を掛けて算出した値を各構成語の存在数として周辺語構成ベクトルを変更しても良い。 In addition to the operation of the co-occurrence word concept estimation unit 40 described above, the co-occurrence word concept estimation unit 40A performs each co-occurrence of the base word co-occurrence vector of the base word co-occurrence table created by the base word co-occurrence vector extraction unit 30. For the base word co-occurrence word that is a compound word in the word, each coefficient is matched with the required estimation method using a coefficient based on the compound word composition distribution table created by the compound word composition distribution estimation unit 36. Estimate co-occurrence word concepts suitable for compound words. As an example, the co-occurrence word concept estimation unit 40A makes each component word constituting each compound word independent as a basic word co-occurrence word, and is based on the compound word component distribution table created by the compound word component distribution estimation unit 36. The predetermined word estimation method by changing the basic word co-occurrence vector using the value calculated by multiplying the co-occurrence number of the basic word co-occurrence word by the constituent word weighting coefficient of each constituent word as the co-occurrence number of each constituent word, Based on the above, the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector is estimated.
As a concept estimation method to be used, a co-occurrence word concept co-occurrence concept table that summarizes peripheral concept vectors corresponding to all basic word co-occurrence words of a specific basic word, taking into account peripheral words including compound words, is used. In this case, for the peripheral words that are compound words among the peripheral words of the peripheral word constituent vectors of the basic word co-occurrence table, each constituent word is made independent as a peripheral word, and the composite word constituent distribution estimating unit 36 The peripheral word composition vector may be changed based on the compound word composition distribution table created in step 1 by using the value calculated by multiplying the number of neighboring words by the constituent word weighting coefficient of each constituent word. .

それ以外の文書入力部１０、基軸単語共起ベクトル抽出部３０、共起語分類部５０、多義語候補推定部６０、多義語候補出力部７０、概念データベース１１０の構成と機能は、第１の実施形態のそれらとそれぞれ同じであるので、説明を省略する。 Other configurations and functions of the document input unit 10, the basic word co-occurrence vector extraction unit 30, the co-occurrence word classification unit 50, the multiple meaning word candidate estimation unit 60, the multiple meaning word candidate output unit 70, and the concept database 110 are as follows. Since they are the same as those of the embodiment, description thereof is omitted.

次に、図３、及び図４に示すシーケンスを参照して、第２の実施形態に係る多義語抽出システム１００Ａの全体の動作について説明する。なお、図４に示す流れ図および以下の説明は処理例であり、第１の実施形態と同様に処理順等を入れ替えたり処理を戻したりすることを行ってもよい。
上述した第１の実施形態の動作と比較すると、以下に説明する本第２の実施形態の動作は、次の動作が加わっている点で異なる。 Next, with reference to the sequence shown in FIG. 3 and FIG. 4, the overall operation of the polysemy extraction system 100A according to the second embodiment will be described. Note that the flowchart shown in FIG. 4 and the following description are examples of processing, and the processing order and the like may be changed or the processing may be returned as in the first embodiment.
Compared to the operation of the first embodiment described above, the operation of the second embodiment described below is different in that the following operation is added.

単語分析部２０Ａは、図１に示した単語分析部２０の動作（ステップＡ２）に加え、抽出された各単語の一般概念情報を概念データベース１１０に問い合わせ、概念データベース１１０に登録が無く、かつ文字数が２文字以上の単語を複合語として抽出する（ステップＢ１）。
さらに単語分析部２０Ａは、複合語を構成するあらゆる部分文字列について、概念データベース１１０に一般概念情報を問い合わせ、一般概念情報の登録がある部分文字列を複合語の有意構成語として抽出し、抽出した有意構成語を元の複合語から分離した場合に概念データベース１１０に一般概念情報の登録が無い部分文字列が残る場合は不明構成語として抽出する（ステップＢ２）。 In addition to the operation (step A2) of the word analysis unit 20 shown in FIG. 1, the word analysis unit 20A inquires the concept database 110 for the general concept information of each extracted word, is not registered in the concept database 110, and has the number of characters Is extracted as a compound word (step B1).
Further, the word analysis unit 20A inquires the concept database 110 for general concept information for every partial character string constituting the compound word, extracts a partial character string in which the general concept information is registered as a significant component word of the compound word, and extracts it. When the significant constituent word is separated from the original compound word, if a partial character string without registration of general concept information remains in the concept database 110, it is extracted as an unknown constituent word (step B2).

次に構成語支配度算出部３５は、単語分析部２０Ａで抽出された各文章に使用されている単語の単語情報および複合語に基づき、複合語共起判定ルールで複合語と共起する単語を複合語共起語として、複合語毎に複合語共起語とその共起数を抽出し、これらをまとめることで複合語共起表を作成する（ステップＢ３）。
さらに構成語支配度算出部３５は、複合共起表と単語分析部２０Ａで抽出された構成語に基づき、前記複合語共起表から同じ構成語を含む部分一致複合語の複合語共起語からなる複合語共起ベクトルを抽出し、構成語別に部分一致複合語共起表を作成し、部分一致複合語共起表の複合語共起ベクトルから得られる共起ベクトル空間における各部分一致複合語間の集約度を構成語支配度として算出する（ステップＢ４）。 Next, the constituent word dominance degree calculation unit 35 is a word that co-occurs with the compound word in the compound word co-occurrence determination rule based on the word information and the compound word of the word used in each sentence extracted by the word analysis unit 20A. As a compound word co-occurrence word, a compound word co-occurrence word and the number of co-occurrence are extracted for each compound word, and a compound word co-occurrence table is created by collecting them (step B3).
Further, the constituent word dominance calculating unit 35 is based on the constituent words extracted by the compound co-occurrence table and the word analyzing unit 20A, and is a compound word co-occurrence word of partially matching compound words including the same constituent word from the compound word co-occurrence table. Extracts compound word co-occurrence vectors consisting of, creates partial match compound word co-occurrence table for each constituent word, and each partially match compound in co-occurrence vector space obtained from compound word co-occurrence vector of partial match compound word co-occurrence table The degree of aggregation between words is calculated as the constituent word dominance (step B4).

次に複合語構成配分推定部３６は、構成語支配度算出部３５で算出した各構成語支配度で複合語毎の各構成語間の構成語重み付け係数を算出し、構成語重み付け係数をまとめた複合語構成配分表を作成する（ステップＢ５）。 Next, the compound word composition distribution estimation unit 36 calculates a component word weighting coefficient between the component words for each compound word with each component word dominance calculated by the component word dominance degree calculation unit 35, and summarizes the component word weighting coefficients. A compound word composition distribution table is created (step B5).

共起語概念推定部４０Ａは、図１に示した共起語概念推定部４０の動作内容（ステップＡ５）に加え、基軸単語共起ベクトル抽出部３０で作成された基軸単語共起表の基軸単語共起ベクトルの各共起語の内で複合語になっている基軸単語共起語について、各構成語をそれぞれ基軸単語共起語として独立させ、複合語構成配分推定部３６で作成した複合語構成配分表に基づき、前記基軸単語共起語の共起数に各構成語の構成語重み付け係数を掛けて算出した値を各構成語の共起数として基軸単語共起ベクトルを変更し、前記所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する（ステップＡ５’）。
他のステップの動作は、上述した第１の実施形態における動作と同一であるので、それらの説明については省略する。 The co-occurrence word concept estimation unit 40A, in addition to the operation content of the co-occurrence word concept estimation unit 40 shown in FIG. 1 (step A5), the basic axis of the basic word co-occurrence table created by the basic word co-occurrence vector extraction unit 30 For the basic word co-occurrence word that is a compound word among the co-occurrence words of the word co-occurrence vector, each component word is made independent as a basic word co-occurrence word, and the compound word composition allocation estimation unit 36 creates Based on the word composition distribution table, the basic word co-occurrence vector is changed with the value calculated by multiplying the co-occurrence number of the basic word co-occurrence word by the constituent word weighting coefficient of each constituent word as the co-occurrence number of each constituent word, Based on the predetermined concept estimation method, the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector is estimated (step A5 ′).
The operation of other steps is the same as the operation in the first embodiment described above, and a description thereof will be omitted.

次に、第２の実施形態の上記動作での効果について説明する。
第２の実施形態では、第１の実施の形態の効果に加え、基軸単語共起語の中の複合語について構成語毎の構成語支配度を算出し、構成語支配度に基づき重み付けを行った概念に変換する。このことによって、シソーラスなどに一般概念情報の登録が無い複合語なども考慮して多義語候補を抽出するように構成できるため、基軸単語共起ベクトルから基軸単語概念ベクトルへの変換の障害となる、独自の複合語の多い文章群でも基軸単語共起語間の類似性の評価が可能になり、所定の案件に関する文書内で複数の意味を割り当てられている多義語をより精度よく抽出できる。
尚、上記第２の実施形態に係る多義語抽出システム１００Ａは、多義語抽出方法として実現され得る。また、上記本発明の第２の実施形態に係る多義語抽出システム１００Ａは、多義語抽出プログラムによりコンピュータによって実行させるようにしても良い。 Next, effects of the above-described operation of the second embodiment will be described.
In the second embodiment, in addition to the effects of the first embodiment, the constituent word dominance degree is calculated for each constituent word for the compound word in the basic word co-occurrence word, and weighting is performed based on the constituent word dominance degree. Convert to a new concept. As a result, it is possible to extract a polysemy candidate in consideration of compound words in which the general concept information is not registered in the thesaurus or the like, which becomes an obstacle to the conversion from the basic word co-occurrence vector to the basic word concept vector. In addition, it is possible to evaluate the similarity between basic word co-occurrence words even in a group of sentences with many unique compound words, and it is possible to more accurately extract a multiple meaning word assigned a plurality of meanings in a document related to a predetermined case.
Note that the multiple meaning extraction system 100A according to the second embodiment may be realized as a multiple meaning extraction method. Moreover, you may make it perform the polysemy extraction system 100A which concerns on the said 2nd Embodiment of this invention with a computer by a polysemy extraction program.

次に、図５を参照して、具体的な第１の実施例を用いて、第１の実施形態に係る多義語抽出システム１００の動作について説明する。 Next, with reference to FIG. 5, the operation of the polysemy extraction system 100 according to the first embodiment will be described using a specific first example.

本第１の実施例では、次のことを目的としている。
先ず、多義語抽出システム１００は、情報システム構築に関する提案書や仕様書といった一般的な意味と異なった概念を示す意味としても使用される多義語を含む文書Ｄ内に含まれる特定の案件に関する文書群でのみ成り立つ多義語候補Ａを推定する。そして、多義語抽出システム１００は、推定結果を出力することで、未登録の用語に関する用語集の作成や単語の定義を支援する。また、本第１の実施例では、多義語抽出システム１００は、図５に示されるように、文書解析システムＹと、インターネット・サーバＺとで構成されるものとする。
文書解析システムＹは、分析実施者Ｂの持つＰＣ端末上で動作し、入力部及び出力部を介して、分析実施者Ｂが多義語を抽出したい文書群を構成する文章の入力と、多義語候補Ａの提示を実現する。
インターネット・サーバＺは、通信ネットワークを介して文書解析システムＹを実装した分析実施者Ｂの持つＰＣ端末と接続されている。インターネット・サーバＺは、文書解析システムＹからの単語の意味などの概念情報の問い合わせに対して、単語の概念分類や一般的な多義語や類義語、用法に関連する一般概念情報Ｃｇの検索を可能にする装置である。 The purpose of the first embodiment is as follows.
First, the polysemy extraction system 100 is a document related to a specific item included in a document D including a polysemy that is also used as a meaning indicating a concept different from a general meaning such as a proposal or specification regarding information system construction. Estimate the ambiguous word candidate A that holds only in the group. Then, the ambiguous word extraction system 100 supports the creation of a glossary and the definition of words related to unregistered terms by outputting the estimation result. Further, in the first embodiment, the polysemy extraction system 100 is composed of a document analysis system Y and an Internet server Z as shown in FIG.
The document analysis system Y operates on the PC terminal possessed by the analyst B, and through the input unit and the output unit, inputs of sentences constituting a group of documents for which the analyst B wants to extract the polygram, Realization of candidate A is realized.
The Internet server Z is connected via a communication network to a PC terminal of the analysis person B who has implemented the document analysis system Y. The Internet server Z can search for general concept information Cg related to word concept classification, general polysemy, synonyms, and usage in response to a query of conceptual information such as the meaning of a word from the document analysis system Y. It is a device to make.

図５と図１との対応関係について説明する。
文書入力部１０と、単語分析部２０と、基軸単語共起ベクトル抽出部３０と、共起語概念推定部４０と、共起語分類部５０と、多義語候補推定部６０とは、文書解析システムＹ内に含まれている。多義語候補出力部７０は、ＰＣ端末の出力部として動作する。概念データベース１１０はインターネット・サーバＺ内に含まれている。この様な手段を備えた文書解析システムＹ、インターネット・サーバＺは以下のような動作をする。 The correspondence between FIG. 5 and FIG. 1 will be described.
The document input unit 10, the word analysis unit 20, the basic word co-occurrence vector extraction unit 30, the co-occurrence word concept estimation unit 40, the co-occurrence word classification unit 50, and the polysemy candidate estimation unit 60 It is included in the system Y. The polysemy candidate output unit 70 operates as an output unit of the PC terminal. The concept database 110 is included in the Internet server Z. The document analysis system Y and the Internet server Z provided with such means operate as follows.

文書解析システムＹは、入力部から、分析実施者Ｂが特定の案件に関する文書群でのみ成り立つ多義語候補Ａを推定したい文書群を構成する文書Ｄの入力を受け付ける。そして、文書解析システムＹは、文書Ｄを構成する文章毎に形態素解析および構文解析を適用して文書を構成する単語に分解し、各単語の品詞とその係り受け関係を解析することで、名詞および、動詞、形容詞、形容動詞を単語Ｗとして抽出する。なお、動詞の内でサ行変格活用に属する動詞は活用部分を除去しいわゆるサ変名詞化したものを動詞として抽出することとする。 The document analysis system Y receives from the input unit input of a document D that constitutes a document group for which the analyst B wants to estimate a multiple meaning candidate A that is formed only in the document group related to a specific case. Then, the document analysis system Y applies morphological analysis and syntax analysis to each sentence constituting the document D, decomposes it into words constituting the document, and analyzes the part-of-speech of each word and its dependency relationship, And the verb, the adjective, and the adjective verb are extracted as the word W. Of the verbs, the verbs belonging to the sa line modification use are extracted as the verbs by removing the use part and converting them into so-called nouns.

さらに文書解析システムＹは、文書Ｄに含まれる単語Ｗの内で名詞を基軸単語Ｓとし、各基軸単語Ｓｉ（ｉ＝１、２、・・・、ｎ）について、特定の基軸単語Ｓｗ（ｉ＝ｗ）と係り受け関係にある動詞と形容詞と形容動詞を、基軸単語共起語Ｖｗｊ（ｊ＝１、２、・・・、ｍ）として抽出し、基軸単語Ｓｗに対する各基軸単語共起語Ｖｗｊの共起回数を共起数Ｎｗｊとして集計し、基軸単語共起ベクトルＮｗを作成する。例えば、文書Ｄから、基軸単語Ｓｗとして「資材」、「総務」、・・・などの単語が、共起語Ｖとして「製造」、「備蓄」、「採掘」、「廃棄」、「混合」、「購買処理」、「見積」、「発注」、「予算」、「変更処理」、・・・などの単語が抽出され、各基軸単語Ｓｗに対する基軸単語共起語Ｖｗｊの共起数Ｎｗｊが図６のようになっていた場合、図６の表の各行のデータセットが基軸単語共起ベクトルＮｉに、特定の基軸単語Ｓｗのデータセットが基軸単語共起ベクトルＮｗに相当し、「資材」の基軸単語共起ベクトルＮｗは｛５，１，１，１，３，３，４，２，１，４・・・｝のように表される。 Further, the document analysis system Y uses the noun as the base word S among the words W included in the document D, and for each base word Si (i = 1, 2,..., N), a specific base word Sw (i = W), verbs, adjectives and adjective verbs that are in a dependency relationship are extracted as basic word co-occurrence words Vwj (j = 1, 2,..., M), and each basic word co-occurrence word for the basic word Sw The number of co-occurrence of Vwj is tabulated as the co-occurrence number Nwj, and a basic word co-occurrence vector Nw is created. For example, from the document D, the words “material”, “general affairs”,..., Etc. as the basic words Sw, and the “co-occurrence words V” “manufacturing”, “stockpiling”, “mining”, “disposal”, “mixing” , “Purchase process”, “Estimate”, “Order”, “Budget”, “Change process”,... Are extracted, and the co-occurrence number Nwj of the basic word co-occurrence word Vwj for each basic word Sw In the case of FIG. 6, the data set of each row in the table of FIG. 6 corresponds to the basic word co-occurrence vector Ni, the data set of a specific basic word Sw corresponds to the basic word co-occurrence vector Nw, and “material” The basic word co-occurrence vector Nw is expressed as {5, 1, 1, 1, 3, 3, 4, 2, 1, 4.

インターネット・サーバＺは、単語の一般的な上位/下位関係、部分/全体関係、同義関係、類義関係などによって単語を分類して体系づけたシソーラスの一般概念情報Ｃｇを蓄積する。また、インターネット・サーバＺは、任意の単語の情報を抽出する検索エンジンなどの機能も提供することで、文書解析システムＹからの問い合わせに応じて、問い合わせ対象の単語の一般的な概念分類として大分類、中分類、小分類を一般概念情報Ｃｇとして抽出し、提示する。 The Internet server Z accumulates thesaurus general concept information Cg that is organized by classifying words according to general upper / lower relationship, partial / whole relationship, synonym relationship, synonym relationship, and the like. In addition, the Internet server Z also provides a function such as a search engine that extracts information on an arbitrary word, so that it can be used as a general concept classification of words to be inquired according to an inquiry from the document analysis system Y. The classification, middle classification, and minor classification are extracted as general concept information Cg and presented.

文書解析システムＹは、基軸単語共起ベクトルＮｗの各基軸単語共起語Ｖｗｊのそれぞれに関する共起語概念Ｃｖｗｊを、インターネット・サーバＺに問い合わせた一般概念情報Ｃｇに基づき抽出する。 The document analysis system Y extracts the co-occurrence word concept Cvwj for each of the basic word co-occurrence words Vwj of the basic word co-occurrence vector Nw based on the general concept information Cg inquired of the Internet server Z.

共起語概念Ｃｖｗｊの抽出方法としては、直接的に各基軸単語共起語Ｖｗｊのそれぞれの一般概念情報Ｃｇについてインターネット・サーバＺに問い合わせを行うことで、インターネット・サーバＺ内に保存されたシソーラスの一般概念情報Ｃｇの分類体系から、各基軸単語共起語Ｖｗｊが属する共起語概念Ｃｖｗｊとして大分類の共起語概念Ｃ１ｖｗｊと、中分類の共起語概念Ｃ２ｖｗｊと、小分類の共起語概念Ｃ３ｖｗｊとを抽出し、各分類の階層での概念共起数Ｎｃｗｊが分かるように木構造などにまとめた共起語概念図Ｃｖｗｊを作成する方法が適切である。この方法を概念直接抽出法とする。概念直接抽出方法に寄れば、図６の基軸単語Ｓｗ「資材」に関する基軸単語共起ベクトルＮｗの共起数Ｎｗｊの各基軸単語共起語Ｖｗｊについて、図７のような共起語概念Ｃ１ｖｗｊ、共起語概念Ｃ２ｖｗｊ、共起語概念Ｃ３ｖｗｊが抽出された場合、共起語概念図Ｃｖｗｊは図８のような樹形図で表される。図８で共起語概念図Ｃｖｗの各分類の階層での概念共起数Ｎｃｗｊはより下位に帰属する基軸単語共起語Ｖｗｊの共起数Ｎｗｊの総和で算出される。なお、インターネット・サーバＺに一般概念情報Ｃｇの登録が無い「変更処理」、「購買処理」などの複合語は、基軸単語共起語の単語をそのまま仮の概念として残して処理する。 As a method for extracting the co-occurrence word concept Cvwj, a thesaurus stored in the Internet server Z can be obtained by inquiring the Internet server Z about the general concept information Cg of each basic word co-occurrence word Vwj directly. From the classification system of the general concept information Cg, the co-occurrence word concept C1vwj of the large classification, the co-occurrence word concept C2vwj of the middle classification, and the co-occurrence of the small classification as the co-occurrence word concept Cvwj to which each basic word co-occurrence word Vwj belongs An appropriate method is to extract the word concept C3vwj and create a co-occurrence word conceptual diagram Cvwj collected in a tree structure or the like so that the concept co-occurrence number Ncwj in each classification hierarchy can be understood. This method is a concept direct extraction method. According to the concept direct extraction method, for each basic word co-occurrence word Vwj of the co-occurrence number Nwj of the basic word co-occurrence vector Nw related to the basic word Sw “material” in FIG. 6, the co-occurrence word concept C1vwj as shown in FIG. When the co-occurrence word concept C2vwj and the co-occurrence word concept C3vwj are extracted, the co-occurrence word concept diagram Cvwj is represented by a tree diagram as shown in FIG. In FIG. 8, the concept co-occurrence number Ncwj in the hierarchy of each classification of the co-occurrence word conceptual diagram Cvw is calculated as the sum of the co-occurrence numbers Nwj of the basic word co-occurrence words Vwj belonging to lower levels. It should be noted that compound words such as “change process” and “purchase process” for which general concept information Cg is not registered in the Internet server Z are processed while leaving the words of the basic word co-occurrence words as they are as tentative concepts.

また、より高度な共起語概念Ｃｖｗｊの抽出方法として概念間接抽出法を以下に解説する。概念間接抽出方法では、各基軸単語共起語Ｖｗｊのそれぞれについて、各基軸単語共起語Ｖｗｊと係り受け関係にある動詞と形容詞と形容動詞、および目次上の同一項目内の文章内で共起する名詞を、周辺語Ｖｖｗｊｆ（ｆ＝１、２、・・・、ｙ）として抽出し、基軸単語共起語Ｖｗｊに対する各周辺語Ｖｖｗｊｆの共起回数を存在数Ｌｊｆとして集計し、全ての基軸単語共起語Ｖｗｊに対する各周辺語Ｖｖｗｊｆについて表形式にまとめた周辺語構成表ＶＶを作成する。
なお、周辺語構成表ＶＶの基軸単語共起語Ｖｗｊに対する各周辺語Ｖｖｗｊｆの存在数Ｌｊｆをまとめたデータセットを周辺語構成ベクトルＬｊと呼ぶ。周辺語構成表ＶＶの各周辺語Ｖｖｗｊｆのそれぞれの一般概念情報Ｃｇについてインターネット・サーバＺに問い合わせを行うことで、インターネット・サーバＺ内に保存されたシソーラスの一般概念情報Ｃｇの分類体系から取得する。その後、各周辺語Ｖｖｗｊｆが属する大分類の周辺語概念Ｃ１ｖｗｊｆと、中分類の周辺語概念Ｃ２ｖｗｊｆと、小分類の周辺語概念Ｃ３ｖｗｊｆとを抽出し、周辺語構成表ＶＶにおける周辺語Ｖｖｗｊｆを周辺語概念Ｃ１ｖｗｊｆに変換し、同じ概念となる周辺語Ｖｖｗｊｆをまとめ、存在数Ｌｊｆの和を対応箇所へ登録した大分類の共起語概念表ＶＣ１、周辺語構成表ＶＶにおける周辺語Ｖｖｗｊｆを周辺語概念Ｃ２ｖｆｗｊｆに変換し、同じ概念となる周辺語Ｖｖｗｊｆをまとめ、存在数Ｌｊｆの和を対応箇所へ登録した中分類の共起語概念表ＶＣ２、周辺語構成表ＶＶにおける周辺語Ｖｖｗｊｆを周辺語概念Ｃ３ｖｆｗｊｆに変換し、同じ概念となる周辺語Ｖｖｗｊｆをまとめ、存在数Ｌｊｆの和を対応箇所へ登録した小分類の共起語概念表ＶＣ３を作成する。
なお、大分類の共起語概念表ＶＣ１の基軸単語共起語Ｖｗｊに対する各周辺語概念Ｃ１ｖｆｗｊｆの存在数Ｌｃ１ｊｆをまとめたデータセットを大分類共起語概念ベクトルＬｃ１ｊと呼び、中分類の共起語概念表ＶＣ２の基軸単語共起語Ｖｗｊに対する各周辺語概念Ｃ２ｖｆｗｊｆの存在数Ｌｃ２ｊｆをまとめたデータセットを中分類基軸単語概念ベクトルＬｃ２ｊと呼び、小分類の共起語概念表ＶＣ３の基軸単語共起語Ｖｗｊに対する各周辺語概念Ｃ３ｖｆｗｊｆの存在数Ｌｃ３ｊｆをまとめたデータセットを小分類共起語概念ベクトルＬｃ３ｊと呼ぶ。 In addition, a concept indirect extraction method will be described below as a method for extracting a more advanced co-occurrence word concept Cvwj. In the concept indirect extraction method, for each basic word co-occurrence word Vwj, verbs, adjectives and adjective verbs that are dependent on each basic word co-occurrence word Vwj, and co-occurrence in sentences in the same item on the table of contents Are extracted as peripheral words Vvwjf (f = 1, 2,..., Y), and the number of co-occurrence of each peripheral word Vvwjf with respect to the basic word co-occurrence word Vwj is counted as the existence number Ljf. A peripheral word configuration table VV is created in which the peripheral words Vvwjf for the word co-occurrence word Vwj are summarized in a table format.
A data set in which the number Ljf of each peripheral word Vvwjf with respect to the basic word co-occurrence word Vwj in the peripheral word configuration table VV is collected is referred to as a peripheral word configuration vector Lj. Obtained from the classification system of the thesaurus general concept information Cg stored in the Internet server Z by making an inquiry to the Internet server Z for the general concept information Cg of each peripheral word Vvwjf in the peripheral word composition table VV . Then, the peripheral word concept C1vwjf of the large classification to which each peripheral word Vvwjf belongs, the peripheral word concept C2vwjf of the medium classification, and the peripheral word concept C3vwjf of the small classification are extracted, and the peripheral word Vvwjf in the peripheral word composition table VV is extracted as the peripheral word The concept C1vwjf is converted into a concept C1vwjf, peripheral words Vvwjf having the same concept are collected, the sum of the number of existence Ljf is registered in the corresponding location, and the co-occurrence word concept table VC1 of the broad classification and the peripheral word Vvwjf in the peripheral word composition table VV C2vfwjf is converted into the same concept, peripheral words Vvwjf are grouped together, and the middle class co-occurrence word concept table VC2 in which the sum of the existing numbers Ljf is registered in the corresponding locations, and the peripheral word Vvwjf in the peripheral word configuration table VV are converted into the peripheral word concept C3vfwjf Co-occurrence of sub-categories in which peripheral words Vvwjf with the same concept are collected and the sum of the number of existence Ljf is registered in the corresponding location To create a concept table VC3.
A data set in which the number Lc1jf of each peripheral word concept C1vfwjf for the basic word co-occurrence word Vwj in the large-category co-occurrence word concept table VC1 is called a large-category co-occurrence word concept vector Lc1j. A data set in which the number Lc2jf of each peripheral word concept C2vfwjf with respect to the basic word co-occurrence word Vwj in the word concept table VC2 is called a middle classification basic word concept vector Lc2j, and the basic word co-occurrence word VC3 in the small classification co-occurrence word conceptual table VC3. A data set in which the numbers Lc3jf of the peripheral word concepts C3vfwjf for the word Vwj are collected is called a small classification co-occurrence word concept vector Lc3j.

ここで、大分類共起語概念ベクトルＬｃ１ｊと中分類基軸単語概念ベクトルＬｃ２ｊと小分類共起語概念ベクトルＬｃ３ｊとが共起語概念Ｃｖｗｊに相当する。例えば、図６のように、文書Ｄから基軸単語共起語Ｖｗｊとして「製造」、「変更処理」、・・・などの単語が抽出され、これらの基軸単語共起語Ｖｗｊの周辺語Ｖｖｗｊｆとして「利用」、「操作」、「構築」、「改善」、「システム変更」、「メカニズム」、「瞬時」、「短期」、「稼働」、「高速処理」、・・・などの単語が抽出された場合、周辺語構成表ＶＶは図９のような、各行に基軸単語共起語Ｖｗｊを各列に周辺語Ｖｖｗｊｆを配置し、その存在数Ｌｊｆを記載した表になる。また、図９の基軸単語共起語Ｖｗｊの行のデータセットが周辺語構成ベクトルＬｊに相当し、「製造」の周辺語構成ベクトルＬｊは｛０、３、２、０、４、０、１、０、３、０、・・・｝のように表される。なお、基軸単語共起語Ｖｗｊと周辺語Ｖｖｗｊｆはいずれも名詞を含むため、先に基軸単語共起語Ｖｗｊとして選択された単語も、他の単語が基軸単語共起語Ｖｗｊの場合は周辺語Ｖｖｗｊｆとして扱われることがある。 Here, the large classification co-occurrence word concept vector Lc1j, the middle classification basic word concept vector Lc2j, and the small classification co-occurrence word concept vector Lc3j correspond to the co-occurrence word concept Cvwj. For example, as shown in FIG. 6, words such as “manufacturing”, “change processing”,... Are extracted from the document D as the basic word co-occurrence word Vwj, and the peripheral words Vvwjf of these basic word co-occurrence words Vwj are extracted. Extract words such as “use”, “operation”, “construction”, “improvement”, “system change”, “mechanism”, “instantaneous”, “short-term”, “operation”, “high-speed processing”, etc. In this case, the peripheral word composition table VV is a table in which the basic word co-occurrence word Vwj is arranged in each row and the peripheral word Vvwjf is arranged in each column and the number of existence Ljf is described as shown in FIG. Further, the data set in the row of the basic word co-occurrence word Vwj in FIG. 9 corresponds to the peripheral word constituent vector Lj, and the peripheral word constituent vector Lj of “manufacturing” is {0, 3, 2, 0, 4, 0, 1 , 0, 3, 0,... Since both the basic word co-occurrence word Vwj and the peripheral word Vvwjf include nouns, the word previously selected as the basic word co-occurrence word Vwj is a peripheral word when another word is the basic word co-occurrence word Vwj. It may be treated as Vvwjf.

さらに図９の周辺語構成表ＶＶにおける各周辺語Ｖｖｗｊｆについて、図１０のような周辺語概念Ｃ１ｖｗｊｆ、周辺語概念Ｃ２ｖｗｊｆ、周辺語概念Ｃ３ｖｗｊｆが抽出された場合、大分類の共起語概念表ＶＣ１は図１１、中分類の共起語概念表ＶＣ２は図１２、小分類の共起語概念表ＶＣ３は図１３のような各行に基軸単語共起語Ｖｗｊを各列に各分類の周辺語概念Ｃｖｗｊｆを配置した表となる。共起語概念表ＶＣ１、ＶＣ２、ＶＣ３の各共起数は、大分類の共起語概念表ＶＣ１を例とすると、周辺語Ｖｖｗｊｆの内で「利用」、「操作」、「構築」、「改善」、「稼働」の周辺語概念Ｃ１ｖｗｊｆは「人間活動」で共通のため、これらの周辺語Ｖｖｗｊｆにおける存在数を同一の基軸単語共起語「製造」に関して足し合わせた「８」が存在数Ｌｃ１ｊｆとなる。同様に周辺語Ｖｖｗｊｆの内で「メカニズム」、「瞬時」、「短期」の周辺語概念Ｃ１ｖｗｊｆは「抽象」で共通のため、これらの周辺語Ｖｖｗｊｆにおける存在数を基軸単語共起語「製造」に関して足し合わせた「１」が存在数Ｌｃ１ｊｆとなる。なお、インターネット・サーバＺに一般概念情報Ｃｇの登録が無い「システム変更」、「高速処理」などの複合語は、共起語の単語をそのまま仮の概念として残して処理する。図１１より、基軸単語共起語「製造」の大分類共起語概念ベクトルＬｃ１ｊは｛８、４、１、０、・・・｝のように表される。 Further, for each peripheral word Vvwjf in the peripheral word composition table VV of FIG. 9, when a peripheral word concept C1vwjf, a peripheral word concept C2vwjf, and a peripheral word concept C3vwjf as shown in FIG. FIG. 11 is a co-occurrence word concept table VC2 of middle classification, FIG. 12 is a co-occurrence word concept table VC2, and FIG. 13 is a co-occurrence word concept table VC3 of a small classification. This is a table in which Cvwjf is arranged. The number of co-occurrence in the co-occurrence word concept table VC1, VC2, VC3 is “use”, “operation”, “construction”, “construction” in the peripheral word Vvwjf, taking the co-occurrence word concept table VC1 of the large classification as an example. Since the peripheral word concept C1vwjf of “improvement” and “operation” is common to “human activity”, the number of existences in these neighboring words Vvwjf is added to the same basic word co-occurrence word “manufacturing” and “8” is present. Lc1jf. Similarly, since the peripheral word concept C1vwjf of “mechanism”, “instantaneous”, and “short term” is common to “abstract” in the peripheral word Vvwjf, the existence number in these peripheral words Vvwjf is used as the basic word co-occurrence word “manufacturing”. “1” obtained by adding together is the existence number Lc1jf. It should be noted that compound words such as “system change” and “high-speed processing” for which general concept information Cg is not registered in the Internet server Z are processed while leaving the co-occurrence word as a temporary concept. From FIG. 11, the major co-occurrence word concept vector Lc1j of the basic word co-occurrence word “manufacturing” is represented as {8, 4, 1, 0,.

さらに文書解析システムＹは、各共起語概念Ｃｖｗｊに基づき各基軸単語共起語Ｖｗｊ間の類似性Ｆｗを算出し、任意の閾値よりも類似性Ｆｗが大きい各基軸単語共起語Ｖｗｊをまとめ、各基軸単語共起語Ｖｗｊをクラスタリングすることで、各基軸単語共起語クラスタＥｗｚを抽出する。 Furthermore, the document analysis system Y calculates the similarity Fw between each basic word co-occurrence word Vwj based on each co-occurrence word concept Cvwj, and summarizes each basic word co-occurrence word Vwj having a similarity Fw larger than an arbitrary threshold. Then, each basic word co-occurrence word Vwj is clustered to extract each basic word co-occurrence word cluster Ewz.

基軸単語共起語Ｖｗｐ（ｊ＝ｐ）と基軸単語共起語Ｖｗｑ（ｊ＝ｑ）の類似性Ｆｗｐｑの算出方法の例としては、前記概念直接抽出法によって各共起語概念Ｃｖｗｊを求めた場合は、共起語概念Ｃｖｗｐと共起語概念Ｃｖｗｑが同一の分類になる分類階層と分類体系における最も大まかな分類階層までの階層差によって定量化する。例えば、図８の例のように大分類（１層目）、中分類（２層目）、小分類（３層目）の３階層からなる分類体系を持つシソーラスで基軸単語共起語Ｖｗｐ「製造」と基軸単語共起語Ｖｗｑ「採掘」は小分類の共起語概念Ｃ３ｖｗｊ「生産」で一致しているため、１層目と３層目の差分として「２」が類似性の指標となる。また、図８の例で類似性Ｆｗの閾値を１以上とすると、中分類の共起語概念Ｃ２ｖｗｊ以下で各基軸単語共起語Ｖｗｊがクラスタリングされることになり、図１４に示すような点線で囲まれた５つのクラスタが基軸単語共起語クラスタＥｗｚとして抽出される。 As an example of a method for calculating the similarity Fwpq between the basic word co-occurrence word Vwp (j = p) and the basic word co-occurrence word Vwq (j = q), each co-occurrence word concept Cvwj was obtained by the concept direct extraction method. In this case, the co-occurrence word concept Cvwp and the co-occurrence word concept Cvwq are quantified by the hierarchy difference between the classification hierarchy in which the classification is the same and the rough classification hierarchy in the classification system. For example, as shown in the example of FIG. 8, the basic word co-occurrence word Vwp “is a thesaurus having a classification system composed of three layers of a large classification (first layer), a middle classification (second layer), and a small classification (third layer). “Manufacturing” and the basic word co-occurrence word Vwq “mining” coincide with each other in the subcategory co-occurrence concept C3vwj “production”, and “2” is the difference index between the first and third layers. Become. Further, if the threshold value of similarity Fw is 1 or more in the example of FIG. 8, the basic word co-occurrence words Vwj are clustered below the middle-class co-occurrence word concept C2vwj, and a dotted line as shown in FIG. The five clusters surrounded by are extracted as basic word co-occurrence word clusters Ewz.

一方、前記概念間接抽出法によって各共起語概念Ｃｖｗｊを求めた場合は、基軸単語共起語Ｖｗｐに対応する大分類共起語概念ベクトルＬｃ１ｐと基軸単語共起語Ｖｗｑに対応する大分類共起語概念ベクトルＬｃ１ｑの間のコサイン距離ｄｃ１ｐｑと、中分類共起語概念ベクトルＬｃ２ｐとＬｃ２ｑの間のコサイン距離ｄｃ２ｐｑと、小分類共起語概念ベクトルＬｃ３ｐとＬｃ３ｑの間のコサイン距離ｄｃ３ｐｑとを算出し、以下の（１）式によりそれぞれの分類重み付け係数β１、β２、β３（β１<β２<β３）を掛けた和を基軸単語共起語間距離ｄｗｐｑとして算出し、逆数など基軸単語共起語間距離ｄｐｑと単調減少の関係にある関数によって類似性Ｆｗｐｑを算出する。この処理を全ての基軸単語共起語Ｖｉｊの組合せについて行う。
ｄｐｑ＝β１×ｄｃ１ｐｑ＋β２×ｄｃ２ｐｑ＋β３×ｄｃ３ｐｑ・・・（１）式 On the other hand, when each co-occurrence word concept Cvwj is obtained by the concept indirect extraction method, the large-category co-occurrence word concept vector Lc1p corresponding to the basic word co-occurrence word Vwp and the large classification co-occurrence word Vwq are corresponded. Calculate cosine distance dc1pq between word concept vector Lc1q, cosine distance dc2pq between middle classification co-occurrence word concept vectors Lc2p and Lc2q, and cosine distance dc3pq between small classification co-occurrence word concept vectors Lc3p and Lc3q Then, the sum of the respective classification weighting coefficients β1, β2, and β3 (β1 <β2 <β3) is calculated as the basic word co-occurrence word distance dwpq by the following equation (1), and the basic word co-occurrence word such as the reciprocal number is calculated. The similarity Fwpq is calculated by a function that is in a monotonically decreasing relationship with the inter-distance dpq. This process is performed for all combinations of basic word co-occurrence words Vij.
dpq = β1 × dc1pq + β2 × dc2pq + β3 × dc3pq (1)

例えば、図１１〜１３の例では基軸単語「製造」と「蓄積」のコサイン距離は、ｄｃ１ｐｑ＝０．２６、ｄｃ２ｐｑ＝０．５７、ｄｃ３ｐｑ＝０．６８となり、分類重み付け係数をβ１＝０．００９、β２＝０．０９、β３＝０．９とすると、基軸単語共起語間距離ｄｐｑ＝０．６７となる。さらに、クラスタリング方法としては各基軸単語共起語Ｖｗｊを初期のクラスタと見なし、基軸単語共起語間距離ｄｐｑについて、最もクラスタ間距離が近いクラスタ同士を新しいクラスタとし、さらに新しい全てのクラスタ間の距離を求め、最も近い２つを結合して新しくクラスタを作るという処理を繰り返し、全てのクラスタが一つのクラスタに結合されるまで繰り返すことでデンドログラムを作成し、任意のクラスタ間距離基準でまとめられた基軸単語共起語Ｖｗｊの集団を基軸単語共起語クラスタＥｗｚとする。図９〜１３の情報に基づき、得られたデンドログラムでクラスタ間距離基準を５とした場合、図１５に示すように２つのクラスタが基軸単語共起語クラスタＥｗｚとして抽出される。 For example, in the examples of FIGS. 11 to 13, the cosine distance between the basic words “manufacturing” and “accumulation” is dc1pq = 0.26, dc2pq = 0.57, dc3pq = 0.68, and the classification weighting coefficient is β1 = 0. When 009, β2 = 0.09, and β3 = 0.9, the basic word co-occurrence distance dpq = 0.67. Further, as a clustering method, each basic word co-occurrence word Vwj is regarded as an initial cluster, and with respect to the basic word co-occurrence word distance dpq, the clusters having the closest inter-cluster distance are set as new clusters, and further, between all new clusters. Find the distance, repeat the process of joining the two closest and create a new cluster, create a dendrogram by repeating until all the clusters are combined into one cluster, and summarize with any intercluster distance criterion A group of the obtained basic word co-occurrence words Vwj is set as a basic word co-occurrence word cluster Ewz. Based on the information of FIGS. 9 to 13, when the inter-cluster distance reference is set to 5 in the obtained dendrogram, two clusters are extracted as the basic word co-occurrence word cluster Ewz as shown in FIG.

文書解析システムＹは、特定の基軸単語Ｓｗに関する各基軸単語共起語Ｖｗｊのクラスタリング結果として得られた基軸単語共起語クラスタＥｗｚについて、また各基軸単語共起語クラスタＥｗｚに属する基軸単語共起語Ｖｗｊの共起数Ｎｗｊの総和をクラスタ規模Ｎｗｚとして抽出する。そして、クラスタ規模Ｎｗｚが任意に定めた閾値以上のクラスタが複数存在する基軸単語Ｓｗを、意味的に複数の用法が見られ、多義語の可能性が想定される多義語候補Ａｗとして抽出する。
前記概念直接抽出法によって各共起語概念Ｃｖｗｊを求めた図６〜９の例で、閾値を２０％とすると、基軸単語共起語Ｖｗｊは２５個あるため、５個以上の基軸単語共起語Ｖｗｊが属する基軸単語共起語クラスタＥｗｚとして「産業」と「経済」の二つが抽出され、基軸単語Ｓｗ「資材」は多義語候補Ａｗと判定される。これは基軸単語共起語クラスタＥｗｚ「産業」に属する基軸単語共起語Ｖｗｊの意味、および基軸単語共起語クラスタＥｗｚ「経済」に属する基軸単語共起語Ｖｗｊの意味から、「資材」は「材料」および「資材調達部門」の略語としての二つの意味を持った可能性が高く、このような多義性を見出すことができる。同様に、前記概念間接抽出法によって各共起語概念Ｃｖｗｊを求めた図１５の例で、閾値を２０％とすると、２つのクラスタはいずれも５個以上の基軸単語共起語Ｖｗｊが属しているため基軸単語Ｓｗ「資材」は多義語候補Ａｗと判定される。
さらに文書解析システムＹは、多義語候補Ａｗについて、要求文書Ｄで該当する多義語候補Ａｗを色分けや太字による強調などの加工を行い、加工後の要求文書Ｄを、出力部から出力する。 The document analysis system Y uses the basic word co-occurrence word cluster Ewz obtained as a result of the clustering of the basic word co-occurrence words Vwj for the specific basic word Sw, and the basic word co-occurrence belonging to each basic word co-occurrence word cluster Ewz. The sum of the co-occurrence numbers Nwj of the word Vwj is extracted as the cluster scale Nwz. Then, the base word Sw in which a plurality of clusters having a cluster size Nwz equal to or greater than a predetermined threshold value is extracted as a multiple meaning word candidate Aw in which a plurality of usages are seen and a possibility of a multiple meaning is assumed.
In the examples of FIGS. 6 to 9 in which each co-occurrence word concept Cvwj is obtained by the concept direct extraction method, if the threshold is 20%, there are 25 basic word co-occurrence words Vwj, and therefore five or more basic word co-occurrence “Industry” and “Economy” are extracted as the basic word co-occurrence word cluster Ewz to which the word Vwj belongs, and the basic word Sw “material” is determined as the ambiguous word candidate Aw. From the meaning of the basic word co-occurrence word Vwj belonging to the basic word co-occurrence cluster Ewz “Industry” and the meaning of the basic word co-occurrence word Vwj belonging to the basic word co-occurrence word cluster Ewz “Economy”, “material” is There is a high possibility of having two meanings as abbreviations of “material” and “material procurement department”, and such ambiguity can be found. Similarly, in the example of FIG. 15 in which each co-occurrence word concept Cvwj is obtained by the concept indirect extraction method, if the threshold value is 20%, each of the two clusters includes five or more basic word co-occurrence words Vwj. Therefore, the basic word Sw “material” is determined as the ambiguous word candidate Aw.
Further, the document analysis system Y processes the multiple meaning word candidate Aw corresponding to the multiple meaning word candidate Aw in the request document D, such as color coding or bold emphasis, and outputs the processed request document D from the output unit.

次に、図１０を参照して、具体的な第２の実施例を用いて、第２の実施形態に係る多義語抽出システム１００Ａの動作を説明する。
本第２の実施例では、多義語抽出システム１００Ａは、図１６に示されるように、インターネット・サーバＺ’を利用するものとする。
文書解析システムＹａは、分析実施者Ｂの持つＰＣ端末上で動作し、入力部及び出力部を介して、分析実施者Ｂが多義語を抽出したい文書群を構成する文章の入力と、多義語候補Ａの提示を実現する。 Next, with reference to FIG. 10, the operation of the multiple meaning word extraction system 100A according to the second embodiment will be described using a specific second example.
In the second embodiment, it is assumed that the multiple meaning word extraction system 100A uses an Internet server Z ′ as shown in FIG.
The document analysis system Ya operates on the PC terminal possessed by the analyst B, and through the input unit and the output unit, inputs of sentences constituting a group of documents for which the analyst B wants to extract the polygram, Realization of candidate A is realized.

インターネット・サーバＺ’は、既存のシソーラスを提供するサーバであり、通信ネットワークを介して文書解析システムＹａを実装した分析実施者Ｂの持つＰＣ端末と接続されている。インターネット・サーバＺ’は、文書解析システムＹａからの単語の概念情報の問い合わせに対して、単語の概念分類や一般的な同義語や類義語、用法に関連する一般概念情報Ｃｇの検索を可能にする装置である。 The Internet server Z ′ is a server that provides an existing thesaurus, and is connected to a PC terminal possessed by an analysis person B who has implemented the document analysis system Ya via a communication network. The Internet server Z ′ enables retrieval of general concept information Cg related to word concept classification, general synonyms and synonyms, and usage in response to a query of word concept information from the document analysis system Ya. Device.

本第２の実施例では、第１の実施例の動作に加え、文書解析システムＹａが構成語支配度算出部３５と、複合語構成配分推定部３６と、を更に含む。
すなわち、図１６と図３との対応関係は次のように成る。
文書入力部１０と、単語分析部２０Ａと、構成語支配度算出部３５と、複合語構成配分推定部３６と、基軸単語共起ベクトル抽出部３０と、共起語概念推定部４０Ａと、共起語分類部５０と、多義語候補推定部６０とは、文書解析システムＹａ内に含まれている。多義語候補出力部７０は、ＰＣ端末の出力部として動作する。概念データベース１１０はインターネット・サーバＺ’内に含まれている。 In the second embodiment, in addition to the operation of the first embodiment, the document analysis system Ya further includes a constituent word dominance calculation unit 35 and a compound word constituent distribution estimation unit 36.
That is, the correspondence between FIG. 16 and FIG. 3 is as follows.
The document input unit 10, the word analysis unit 20A, the constituent word dominance calculation unit 35, the compound word composition distribution estimation unit 36, the basic word co-occurrence vector extraction unit 30, the co-occurrence word concept estimation unit 40A, The word classification unit 50 and the multiple meaning word candidate estimation unit 60 are included in the document analysis system Ya. The polysemy candidate output unit 70 operates as an output unit of the PC terminal. The concept database 110 is included in the Internet server Z ′.

この様な構成を含めた文書解析システムＹａは、上述した第１の実施例に対して、以下のような動作を加える。
文書解析システムＹａは、各基軸単語共起語Ｖｉｊのそれぞれの一般概念情報Ｃｇをインターネット・サーバＺ’に問い合わせることで、インターネット・サーバＺ’内に保存されたシソーラスに、各単語Ｖｉｊの一般概念情報Ｃｇが登録されているかどうかを検索し、シソーラスに一般概念情報Ｃｇの登録が無く、かつ文字数が２文字以上の単語を複合語Ｖｍｅ（ｅ＝１、２、・・・、ｈ）として抽出する。例えば「購買処理」という単語がシソーラスに登録されていない場合は、２文字以上であるため複合語として抽出する。 The document analysis system Ya including such a configuration adds the following operation to the first embodiment described above.
The document analysis system Ya makes an inquiry to the Internet server Z ′ for the general concept information Cg of each basic word co-occurrence word Vij, so that the general concept of each word Vij is stored in the thesaurus stored in the Internet server Z ′. Search whether the information Cg is registered, and extract a word having two or more characters as a compound word Vme (e = 1, 2,..., H) without registration of the general concept information Cg in the thesaurus. To do. For example, when the word “purchase processing” is not registered in the thesaurus, it is extracted as a compound word because it has two or more characters.

さらに文書解析システムＹａは、複合語Ｖｍｅ毎に複合語Ｖｍｅの文字列をあらゆるパターンで分離し、分離した全ての部分文字列について、インターネット・サーバＺ’内に保存されたシソーラスに一般概念情報Ｃｇが登録されているかどうかを検索する。そして、一般概念情報の登録がない部分文字列の文字数が最も少なくなるパターンでの、部分文字列を複合語Ｖｍｅの構成語Ｐｅｋ（ｋ＝１、２、・・・、ｌ）として処理し、構成語Ｐｅｋの内、一般概念情報Ｃｇの登録が有る部分文字列は有意構成語Ｐａｅｋとし、登録が無い部分文字列は不明構成語Ｐｂｅｋとして、それぞれ複合語毎に抽出する。
図６の「購買処理」という複合語の例では、｛「購」、「買処理」｝、｛「購買」、「処理」｝、｛「購買処」、「理」｝が分離可能な文字列として想定され、「買処理」と「購買処」がシソーラスに登録されていない場合は、「購」、「購買」、「処理」、「理」が有意構成語Ｐａｅｋの候補、「買処理」、「購買処」が不明構成語Ｐｂｅｋの候補となるが、一般概念情報Ｃｇの登録がない部分文字列の文字数が最も少ない｛「購買」、「処理」｝の組合せが複合語「購買処理」の有意構成語として選択される。 Further, the document analysis system Ya separates the character string of the compound word Vme in every pattern for each compound word Vme, and the general concept information Cg is stored in the thesaurus stored in the Internet server Z ′ for all the separated partial character strings. Search if is registered. Then, the partial character string is processed as a constituent word Pek (k = 1, 2,..., L) of the compound word Vme in a pattern in which the number of characters of the partial character string without registration of general concept information is the smallest. Among the constituent words Pek, the partial character string in which the general concept information Cg is registered is extracted as a significant constituent word Paek, and the partial character string without registration is extracted as an unknown constituent word Pbek for each compound word.
In the example of the compound word “purchase process” in FIG. 6, {“purchase”, “buy process”}, {“purchase”, “process”}, {“purchase process”, “reason”} can be separated. If “buy process” and “purchase process” are not registered in the thesaurus, “buy”, “purchase”, “process”, and “reason” are candidates for significant constituent word Paek, “buy process” ”,“ Purchase ”is a candidate for the unknown constituent word Pbek, but the combination of {“ Purchase ”,“ Process ”}} having the smallest number of characters in the partial character string for which the general concept information Cg is not registered is the compound word“ Purchase As a significant constituent word.

文書解析システムＹａは、「構築する情報システムの機能」など文書Ｄで一定の範囲の内容に言及している文章群として分析者Ｂが指定した段落の文章内で複合語Ｖｍｅと共起する名詞、および複合語Ｖｍｅに係る動詞と形容詞、形容動詞をｓ個の複合語共起語Ｕｍｅｒ（ｒ＝１、２、・・・、ｓ）として、複合語Ｖｍｅ毎に複合語共起語Ｕｍｅｒと、共起と見なした範囲内での共起回数Ｍｅｒを抽出し、各行を各複合語Ｖｍｅに各列を各複合語共起語Ｕｍｅｒに対応させ、複合語Ｖｍｅに対する複合語共起語Ｕｍｅｒの共起回数Ｍｅｒを各値として登録した疎行列からなる複合語共起表ＶＵｍを作成する。
さらに、文書解析システムＹａは、前記複合語共起表ＶＵｍの各構成語Ｐｅｋ別に、同じ構成語Ｐｘ（ｘ＝１、２、・・・、ｔ）を含むｔ個の複合語Ｖｍｘの行成分（Ｍｘ１，Ｍｘ２，Ｍｘ３，・，・，・，Ｍｘｓ）を抽出し、各行成分を各複合語Ｖｍｘに、各列を各複合語共起語Ｕｍｘｒに対応させ、複合語Ｖｍｘに対する複合語共起語Ｕｍｘｒの共起回数Ｍｘｒを各値として登録した疎行列からなる部分一致複合語共起表ＶＵｘを作成する。
例えば「処理」という構成語を含む部分一致複合語共起表としては図１７、「変更」という構成語を含む部分一致複合語共起表としては図１８のような表が作成される。さらに、文書解析システムＹａは、以下の数１のように、部分一致複合語共起表ＶＵｘの複合語共起語Ｕｍｘｒ毎のデータ列（Ｍ１ｒ，Ｍ２ｒ，Ｍ３ｒ，・，・，・，Ｍｔｒ）で分散σｘｒを算出し、全複合語共起語Ｕｍｘｒの分散σｘｒの平均値の平方根の逆数を構成語Ｐｘの構成語支配度Ｇｘとして算出する。

The document analysis system Ya is a noun that co-occurs with the compound word Vme in the sentence of the paragraph specified by the analyst B as a sentence group that refers to the contents of a certain range in the document D such as “function of the information system to be constructed”. , And the verb and adjective related to the compound word Vme, and the adjective verb as s compound word co-occurrence words Umer (r = 1, 2,..., S), and for each compound word Vme, the compound word co-occurrence word Umer and , The co-occurrence number Mer within the range regarded as co-occurrence is extracted, each row is associated with each compound word Vme, each column is associated with each compound word co-occurrence word Umer, and the compound word co-occurrence word Umer with respect to the compound word Vme. A compound word co-occurrence table VUm composed of a sparse matrix in which the co-occurrence number Mer is registered as each value is created.
Further, the document analysis system Ya uses the row components of t compound words Vmx including the same component word Px (x = 1, 2,..., T) for each component word Pek of the compound word co-occurrence table VUm. (Mx1, Mx2, Mx3,..., Mxs) are extracted, each row component is associated with each compound word Vmx, each column is associated with each compound word co-occurrence word Umxr, and compound word co-occurrence with respect to the compound word Vmx A partially matched compound word co-occurrence table VUx composed of a sparse matrix in which the co-occurrence count Mxr of the word Umxr is registered as each value is created.
For example, a table as shown in FIG. 17 is created as the partially matching compound word co-occurrence table including the constituent word “processing”, and FIG. 18 is created as the partially matching compound word co-occurrence table including the constituent word “change”. Further, the document analysis system Ya uses a data string (M1r, M2r, M3r,..., Mtr) for each compound word co-occurrence word Umxr of the partial match compound word co-occurrence table VUx as shown in the following equation (1). Then, the variance σxr is calculated, and the reciprocal of the square root of the average value of the variance σxr of all compound word co-occurrence words Umxr is calculated as the constituent word dominance Gx of the constituent word Px.

文書解析システムＹａは、複合語Ｖｍｅ毎の各構成語Ｐｅｋに対応する各構成語支配度Ｇｘｅｋの値を構成語支配度Ｇｘｅｋの総和で除すことで正規化した構成語重み付け係数αｅｋを算出し、各行を各複合語Ｖｍｅに各列を各構成語Ｐｅｋに対応させ、複合語Ｖｍｅに対する各構成語Ｐｅｋの構成語重み付け係数αｅｋを各値として登録した疎行列からなる、複合語構成配分表Ｔｅを作成する。
例えば、図６の基軸単語共起語の内で複合語であった「変更処理」、「購買処理」に関して、構成語「処理」の構成語支配度Ｇｘが１．４７で、構成語「変更」の構成語支配度Ｇｘが２．２１、構成語「購買」の構成語支配度Ｇｘが３．４３であった場合、複合語構成配分表Ｔｅは図１９のようになる。図１９は、複合語「変更処理」を構成語「変更」と構成語「処理」の組合せとして理解する場合、構成語「変更」の方が構成語「処理」よりも重要であることを示している。 The document analysis system Ya calculates a normalized constituent weighting coefficient αek by dividing the value of each constituent word dominance Gxek corresponding to each constituent word Pek for each compound word Vme by the sum of the constituent word dominance degrees Gxek. , Each word is associated with each compound word Vme, each column is associated with each component word Pek, and each component word Pek with respect to the compound word Vme is registered as each value as a compound word composition distribution table Te. Create
For example, with respect to “change processing” and “purchase processing” that are compound words in the basic word co-occurrence words in FIG. 6, the constituent word control degree Gx of the constituent word “processing” is 1.47, and the constituent word “change” 19 is 2.21, and the constituent word dominance Gx of the constituent “purchase” is 3.43, the compound word constituent distribution table Te is as shown in FIG. FIG. 19 shows that when the compound word “change process” is understood as a combination of the constituent word “change” and the constituent word “process”, the constituent word “change” is more important than the constituent word “process”. ing.

文書解析システムＹａは、複合語Ｖｍｅが基軸単語共起語Ｖｉｊの一つであるという観点から、特定の基軸単語Ｓｗと共起した複合語Ｖｍｗｅの構成語Ｐｅｋを、それぞれ基軸単語共起語Ｖｍｗｅｋとして独立させる。そして、複合語構成配分表Ｔｅに基づき複合語Ｖｍｗｅの共起数Ｎｗｅに各構成語Ｐｅｋに対応する構成語重み付け係数αｅｋを掛けた値を共起数Ｎｗｅｋとして算出することで、基軸単語共起ベクトルＮｗを変更する。図６の基軸単語「資材」について詳細に説明すれば、複合語である「変更処理」と「購買処理」の構成語「処理」と「変更」、および「購買」と「処理」が基軸単語共起語として独立し、図１９に示しているように「変更処理」の構成語重み付け係数が「変更＝０．６」、「処理＝０．４」で、同様に「購買処理」の構成語重み付け係数が「購買＝０．７」と「処理＝０．３」であるので、重み付け共起数Ｎｗｅｋは「変更：２．４＝４×０．６」、「処理：２．５＝４×０．４＋３×０．３」、「購買：２．１＝３×０．７」となる。他の各基軸単語共起語Ｖｗも同様に処理を行い、図６に示した基軸単語共起ベクトルは図２０に示す基軸単語共起ベクトルのように変換される。 From the viewpoint that the compound word Vme is one of the basic word co-occurrence words Vij, the document analysis system Ya selects the basic word co-occurrence word Vmwe for each of the constituent words Pek of the compound word Vmwe that co-occurs with the specific basic word Sw. As independent. Then, based on the compound word composition distribution table Te, a value obtained by multiplying the co-occurrence number Nwe of the compound word Vmwe by the component word weighting coefficient αek corresponding to each component word Pek is calculated as the co-occurrence number Nwek, so that the basic word co-occurrence The vector Nw is changed. The key word “material” in FIG. 6 will be described in detail. The compound words “change” and “purchase”, which are compound words “process” and “change”, and “purchase” and “process” are key words. As a co-occurrence word, as shown in FIG. 19, the component word weighting coefficient of “change process” is “change = 0.6”, “process = 0.4”, and similarly the configuration of “purchase process” Since the word weighting coefficients are “purchasing = 0.7” and “processing = 0.3”, the weighted co-occurrence number Nwek is “change: 2.4 = 4 × 0.6”, “processing: 2.5 = 4 × 0.4 + 3 × 0.3 ”and“ Purchase: 2.1 = 3 × 0.7 ”. The other basic word co-occurrence words Vw are processed in the same manner, and the basic word co-occurrence vector shown in FIG. 6 is converted into the basic word co-occurrence vector shown in FIG.

文書解析システムＹａの他の動作は第一の実施例と同様である。例えば、図２０の基軸単語Ｓｗ「資材」に関する基軸単語共起ベクトルＮｗの共起数Ｎｗｊの各基軸単語共起語Ｖｗｊについて、前記概念直接抽出法によって、図２１のような共起語概念Ｃ１ｖｗｊ、共起語概念Ｃ２ｖｗｊ、共起語概念Ｃ３ｖｗｊが抽出された場合、共起語概念図Ｃｖｗｊは図２２のような樹形図で表される。さらに、図２２の例で類似性Ｆｗの閾値を１以上とすると、中分類の共起語概念Ｃ２ｖｗｊ以下で各基軸単語共起語Ｖｗｊがクラスタリングされることになり、図２３に示すような点線で囲まれた３つのクラスタが基軸単語共起語クラスタＥｗｚとして抽出される。クラスタ規模の閾値を第一の実施例よりも高く３０％としても、基軸単語共起語Ｖｗｊが属する基軸単語共起語クラスタＥｗｚとして「産業」と「経済」の二つが抽出され、基軸単語Ｓｗ「資材」は多義語候補Ａｗと判定される。このように辞書に登録されていない未知の複合語を、構成語毎に考慮することでより多くの基軸単語共起語を考慮した多義語の推定が精度よく可能となる。 Other operations of the document analysis system Ya are the same as those in the first embodiment. For example, for each basic word co-occurrence word Vwj of the co-occurrence number Nwj of the basic word co-occurrence vector Nw regarding the basic word Sw “material” in FIG. 20, the co-occurrence word concept C1vwj as shown in FIG. When the co-occurrence word concept C2vwj and the co-occurrence word concept C3vwj are extracted, the co-occurrence word concept diagram Cvwj is represented by a tree diagram as shown in FIG. Furthermore, if the threshold value of similarity Fw is 1 or more in the example of FIG. 22, each basic word co-occurrence word Vwj is clustered below the middle-class co-occurrence word concept C2vwj, and a dotted line as shown in FIG. Are extracted as the basic word co-occurrence word cluster Ewz. Even if the threshold of the cluster scale is set to 30% higher than in the first embodiment, two of “industry” and “economy” are extracted as the basic word co-occurrence word cluster Ewz to which the basic word co-occurrence word Vwj belongs, and the basic word Sw “Material” is determined to be an ambiguous word candidate Aw. As described above, unknown compound words that are not registered in the dictionary are taken into account for each constituent word, so that it is possible to accurately estimate a polysemy that takes into account more basic word co-occurrence words.

以上説明したように、本発明の多義語抽出システムによれば、情報システム構築に関する提案書や仕様書等といった所定の案件に関する文書内で複数の意味を割り当てられている多義語のある文書について、その文書で成り立っている多義語を把握することが分析に使用した文書又は文書群から可能となる。もって、情報システムの構築時に、誤解に基づく混乱や失敗などの削減につながる。その理由は、単語の共起語の類似性を概念レベルでの一致具合で算出し、共起語をクラスタリングすることで、特定の案件に関する文書群という限られた文書量の情報で同一の共起語の使用が無くても、用法的に複数の共起語群を持ち多義である可能性の高い単語を抽出可能にしているためである。 As described above, according to the polysemy extraction system of the present invention, a document having a polysemy that is assigned a plurality of meanings in a document related to a predetermined matter such as a proposal or a specification for information system construction, It is possible to grasp the ambiguous words made up of the document from the document or document group used for the analysis. As a result, confusion and failure due to misunderstandings can be reduced during the construction of information systems. The reason for this is that the similarity of co-occurrence words of words is calculated based on the matching level at the concept level, and the co-occurrence words are clustered, so that the same co-occurrence information can be obtained with a limited amount of documents related to a specific project. This is because even if there is no use of words, it is possible to extract words that have a plurality of co-occurrence word groups and are likely to be ambiguous.

また、本発明の具体的な構成は前述の実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の変更があってもこの発明に含まれる。 In addition, the specific configuration of the present invention is not limited to the above-described embodiment, and changes within a range not departing from the gist of the present invention are included in the present invention.

例えば、一般概念と異なる概念での用法を有する多義語を含む文書中からその多義語を所要に抽出するため、多義語抽出システムとして動作する情報処理装置を、入力部から受け付けた文書から多義語を抽出する際に、文章として使用されている各単語の抽出を行うと共に、抽出した単語群から任意の単語を基軸単語として共起関係を有する基軸単語共起語及び共起数から前記基軸単語の基軸単語共起ベクトルを抽出し、該基軸単語共起ベクトルに含まれる各基軸単語共起語の共起語概念を個々に推定し、推定した共起語概念間の類似性に基づいて含まれていた各基軸単語共起語をクラスタ化し、前記任意の単語として選択した基軸単語に関して複数のクラスタが存在した際に該基軸単語を多義語候補とする処理を繰り返して、抽出した多義語候補を出力部から出力する。 For example, an information processing device that operates as a polysemy extraction system is extracted from a document received from an input unit in order to extract the polysemy as necessary from a document containing polysemy having a usage in a concept different from the general concept. Are extracted from each word used as a sentence, and the basic word from the extracted word group having a co-occurrence relationship with an arbitrary word as a basic word and the basic word from the co-occurrence number The basic word co-occurrence vector is extracted, the co-occurrence word concept of each basic word co-occurrence word included in the basic word co-occurrence vector is estimated individually, and included based on the similarity between the estimated co-occurrence word concepts Each of the basic word co-occurrence words is clustered, and when there are a plurality of clusters related to the basic word selected as the arbitrary word, the process of setting the basic word as a multiple word candidate is repeated and extracted. And it outputs a candidate from the output unit.

この際に、分析対象とする文書毎（文章群毎）に重み付けを与えられるようにしてもよい。
例えば、確度の高い文書とそうではない文書とを重み付けと共に入力を受け付けて、各係数として使用してもよい。
また、分析対象とする文書群（文章群）の作成者や所属機関などについて重み付けを与えられるようにしてもよい。
また、文書群の有する引用関係や引用数に基づいて重み付けを算定してもよい。
また、翻訳されている文章についてその多義語を抽出する際に、使用する一般概念を翻訳前の元言語の一般概念を使用するようにしてもよい。
これらの情報は、操作者から受け付けることとしてもよいし、文章を構文解析や意味解析などの自然言語解析を行って自動的に抽出するようにしてもよい。
また、自然言語解析によって、使用する概念推定方法などのアルゴリズムを、適する候補の抽出や自動選択を行うようにしてもよい。 At this time, a weight may be given to each document to be analyzed (each sentence group).
For example, a highly accurate document and a document that is not so may be input together with weighting and used as each coefficient.
Also, the creator of the document group (sentence group) to be analyzed or the affiliated organization may be weighted.
Also, the weight may be calculated based on the citation relationship and the number of citations that the document group has.
Moreover, when extracting the polysemy about the translated sentence, you may make it use the general concept of the original language before translation for the general concept to be used.
These pieces of information may be received from the operator, or the sentence may be automatically extracted by performing natural language analysis such as syntax analysis or semantic analysis.
Further, suitable candidates may be extracted or automatically selected by an algorithm such as a concept estimation method to be used by natural language analysis.

本発明によれば、ソフトウェアやシステムの開発における要件定義などの作業においてやり取りされる各種文書に関して、文書の曖昧さを除外することで文書の理解・作成・修正を支援することが可能になり、手戻りの減少や顧客満足の向上などシステム開発の効率化に関する用途に適用できる。
また、多義語を精度よく抽出できるので、翻訳システムに用いて訳し分けに利用できる。 According to the present invention, it is possible to support understanding, creation, and correction of documents by excluding ambiguity of documents for various documents exchanged in work such as requirement definition in software and system development. It can be applied to applications related to streamlining system development, such as reducing rework and improving customer satisfaction.
In addition, since multiple terms can be extracted with high accuracy, they can be used in translation systems for translation.

１０文書入力部
２０、２０Ａ単語分析部
３０基軸単語共起ベクトル抽出部
３５構成語支配度算出部
３６複合語構成配分推定部
４０、４０Ａ共起語概念推定部
５０共起語分類部
６０多義語候補推定部
７０多義語候補出力部
１００、１００Ａ多義語抽出システム
Ｄ文書
Ｙ、Ｙａ文書解析システム
Ｚ、Ｚ’ インターネット・サーバ DESCRIPTION OF SYMBOLS 10 Document input part 20, 20A Word analysis part 30 Basic word co-occurrence vector extraction part 35 Composition word dominance calculation part 36 Compound word structure allocation estimation part 40, 40A Co-occurrence word concept estimation part 50 Co-occurrence word classification part 60 Ambiguous word Candidate estimation section 70 Ambiguous word candidate output section 100, 100A Ambiguous word extraction system D Document Y, Ya Document analysis system Z, Z 'Internet server

Claims

A word analysis unit that extracts each word used in a predetermined sentence that has been input;
A base word that selects an arbitrary word from the words as a base word, and extracts a base word co-occurrence vector represented by the base word co-occurrence word and the co-occurrence number considered to be a co-occurrence relationship with the base word A co-occurrence vector extraction unit;
A co-occurrence word concept estimation unit that estimates a co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector from a general concept;
About the estimated co-occurrence word concept group, based on the similarity between corresponding co-occurrence word concepts, a co-occurrence word classification unit that performs clustering of each of the basic word co-occurrence words related to the selected basic word,
A multiple word candidate estimation unit that extracts the basic word as a multiple word candidate when a plurality of clusters exist with respect to the selected basic word;
A multiple word candidate output unit that outputs the extracted multiple word candidates;
A polysemy extraction system characterized by comprising

A document input unit for receiving input of a target document or document group, and
A word analysis unit for extracting each word used in a sentence constituting a document or a document group, and extracting word information about a part of speech and a case for each word, a particle to be combined, and a dependency relation between words;
An arbitrary word is selected as a basic word among the words, and a basic word co-occurrence word that is regarded as a co-occurrence relationship with the basic word in an arbitrary basic word co-occurrence determination rule based on word information for each basic word and its word A basic word co-occurrence vector extracting unit for extracting a basic word co-occurrence vector represented by the co-occurrence number;
Collects and accumulates general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and collects general concept information related to the meaning and usage of words for inquiries about specific words A conceptual database that searches and responds;
A co-occurrence word concept estimator that estimates the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector based on a predetermined concept estimation method using general concept information of the concept database;
For each basic word co-occurrence word for a specific basic word, the similarity between the corresponding co-occurrence word concepts is calculated by a predetermined similarity index, and each basic word co-occurrence word is calculated based on the similarity index between the co-occurrence word concepts. A co-occurrence classifier that clusters words,
A multiple word candidate estimator for extracting, as a multiple word candidate, a basic word in which a plurality of clusters equal to or larger than a threshold that is arbitrarily determined by the size of each cluster of each basic word co-occurrence word for each basic word;
A multiple word candidate output unit that outputs the extracted multiple word candidates;
A polysemy extraction system characterized by comprising

A word analysis unit that extracts each word used in a predetermined sentence that has been input and extracts a compound word and its constituent words within each word;
A constituent word dominance calculation unit for calculating a constituent word dominance for each constituent word;
A compound word composition distribution estimator that calculates a composition word weighting coefficient for each compound word using each constituent word dominance degree;
A base word that selects an arbitrary word from the words as a base word, and extracts a base word co-occurrence vector represented by the base word co-occurrence word and the co-occurrence number considered to be a co-occurrence relationship with the base word A co-occurrence vector extraction unit;
For each basic word co-occurrence word that is a compound word in each basic word co-occurrence vector of the basic word co-occurrence table, each constituent word is treated as a basic word co-occurrence word, and the basic word co-occurrence The basic word co-occurrence vector is updated using the value calculated by multiplying the number of word co-occurrence by the word weighting coefficient of each constituent word as the co-occurrence number of each constituent word, and each basic word co-occurrence word of the basic word co-occurrence vector A co-occurrence word concept estimation unit that estimates a co-occurrence word concept from a general concept;
About the estimated co-occurrence word concept group, based on the similarity between corresponding co-occurrence word concepts, a co-occurrence word classification unit that performs clustering of each of the basic word co-occurrence words related to the selected basic word,
A multiple word candidate estimation unit that extracts the basic word as a multiple word candidate when a plurality of clusters exist with respect to the selected basic word;
A multiple word candidate output unit that outputs the extracted multiple word candidates;
A polysemy extraction system characterized by comprising

A document input unit for receiving input of a target document or document group, and
Collects and accumulates general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and collects general concept information related to the meaning and usage of words for inquiries about specific words A conceptual database that searches and responds;
Extract each word used in a sentence that composes a document or group of documents, and extract word information related to part-of-speech and case for each word, particle to be combined, and dependency relation between words. A word with no general concept information registered and two or more characters is extracted as a compound word, and a partial character string with general concept information registered is extracted from any partial character string constituting the compound word. A word analysis unit that extracts as a significant constituent word and extracts a partial character string without registration as an unknown constituent word;
Based on the word information of each word and the compound word, the compound word co-occurrence rule is used as a compound word co-occurrence word, and the compound word co-occurrence word and the number of co-occurrence are extracted for each compound word. By combining these, a compound word co-occurrence table is created, and a compound word co-occurrence vector composed of compound word co-occurrence words of partially matching compound words including the same component word is extracted from the compound word co-occurrence table, and configured Create a partially matched compound word co-occurrence table for each word, and calculate the degree of aggregation between each partially matched compound word in the co-occurrence vector space obtained from the compound word co-occurrence vector of the partially matched compound word co-occurrence table as the constituent word dominance A constituent word dominance degree calculation unit,
Calculating a constituent word weighting coefficient between constituent words for each compound word using each constituent word dominance degree, and creating a composite word constituent distribution table that summarizes the constituent word weighting coefficients;
An arbitrary word is selected as a basic word among the words, and a basic word co-occurrence word that is regarded as a co-occurrence relationship with the basic word in an arbitrary basic word co-occurrence determination rule based on word information for each basic word and its word A basic word co-occurrence vector extracting unit for extracting a basic word co-occurrence vector represented by the co-occurrence number;
For the basic word co-occurrence words that are compound words among the co-occurrence words in the basic word co-occurrence vector of the basic word co-occurrence table, each component word is treated as a basic word co-occurrence word, and the compound word composition distribution table The basic word co-occurrence vector is updated using the value calculated by multiplying the co-occurrence number of the basic word co-occurrence word by the constituent word weighting coefficient of each constituent word as the co-occurrence number of each constituent word. A co-occurrence word concept estimation unit that estimates the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector based on a predetermined concept estimation method using concept information;
For each basic word co-occurrence word related to the arbitrary basic word, the similarity between the corresponding co-occurrence word concepts is calculated by a predetermined similarity index, and each basic word is based on the similarity index between the co-occurrence word concepts A co-occurrence word classifier that performs clustering of co-occurrence words,
A multiple word candidate estimator for extracting, as a multiple word candidate, a basic word in which a plurality of clusters equal to or larger than a threshold that is arbitrarily determined by the size of each cluster of each basic word co-occurrence word for each basic word;
A multiple word candidate output unit that outputs the extracted multiple word candidates;
A polysemy extraction system characterized by comprising

The base word co-occurrence determination rule in the base word co-occurrence vector extraction unit is a rule that regards a word having a dependency relationship with the base word as a co-occurrence word, or with a specific particle in the same sentence as the base word The multi-word extraction system according to claim 2 or 4, wherein a rule that regards a used word as a co-occurrence word is used.

The concept database stores a word with a classification system, and is a thesaurus that can be acquired as general concept information about synonymous relationships, synonymous relationships, upper / lower relationships, and partial / whole relationships between words. The multiple meaning word extraction system according to claim 2 or 4.

The concept estimation method of the co-occurrence word concept estimation unit queries the concept database for general concept information related to each basic word co-occurrence word, and replaces all basic word co-occurrence words of a specific basic word with the general concept information concept The word co-occurrence concept vector is a co-occurrence word concept, and the co-occurrence word classification unit performs clustering using the depth of classification until all basic word co-occurrence words are regarded as the same general concept information concept as a similarity index. The multiple meaning word extraction system according to claim 6.

The concept estimation method of the co-occurrence word concept estimation unit calculates all peripheral words that exist around the base word co-occurrence word and the number of peripheral words based on the number of presence words based on any peripheral word determination rule for the base word co-occurrence word. Create a peripheral word composition table summarizing basic word co-occurrence words, inquire general concept information from the concept database for each of the peripheral words in the peripheral word composition vector of the peripheral word composition table, and Peripheral word corresponding to all basic word co-occurrence words of a specific basic word is created for each corresponding basic word co-occurrence word by creating a peripheral word concept vector obtained by converting each peripheral word of each peripheral word constituent vector in the composition table into a general concept A basic word co-occurrence concept table that summarizes concept vectors is used as a co-occurrence word concept.
The co-occurrence word classifying unit calculates the distance between the peripheral word concept vectors corresponding to each basic word co-occurrence word for each hierarchy, and the distance weighted so as to emphasize the distance in the more detailed classification and the monotonic decrease The multi-word extraction system according to claim 6, wherein clustering is performed using function values having a relationship of

The range in which the arbitrary peripheral word determination rule in the concept estimation method of the co-occurrence word concept estimation unit regards as a periphery for each part of speech such as a verb that coexists in one sentence and a noun in a sentence in the same item on the table of contents. The system of claim 8, further comprising an algorithm for changing

Range in which the compound word co-occurrence determination rule of the constituent word dominance calculation unit considers co-occurrence for each part of speech, such as a word having a dependency relationship if the part of speech is a verb, or a word in the same paragraph if the part of speech is a noun 10. The multi-word extraction system according to claim 4, further comprising an algorithm that extracts a compound word co-occurrence word and calculates a compound word co-occurrence number under different conditions.

A function in which the degree of aggregation between the partially matched compound words in the constituent word dominance calculating unit is a monotonously decreasing function as an index indicating the degree of dispersion between vectors corresponding to each partially matched compound word The multi-word extraction system according to any one of claims 4 to 10, which is calculated by:

The degree of aggregation between partially matched compound words in the constituent word dominance calculation unit is calculated based on a vector space weighted by the part of speech of a co-occurrence word. The ambiguous word extraction system according to item 1.

The compound word composition distribution estimation unit calculates a normalized weighting coefficient by dividing the component word dominance of each component word of the compound word by the sum of the component word dominance of each compound word. The multi-word extraction system according to any one of claims 4 to 12.

A weighting factor is given for each document or sentence group to be analyzed, and the estimated co-occurrence word concept of each basic word co-occurrence word for an arbitrary basic word is used to estimate from the general concept. 14. The multi-word extraction system according to claim 1, wherein a cluster is formed using a co-occurrence word concept to determine whether the base word is a multi-word word candidate.

Extract each word used in the given sentence that received the input,
An arbitrary word is selected as a base word from the extracted words, and a base word co-occurrence vector represented by the base word co-occurrence word and the co-occurrence number regarded as the co-occurrence relation with the base word is extracted. And
Estimating the co-occurrence word concept of each basic word co-occurrence word of the extracted basic word co-occurrence vector from the general concept,
For the estimated co-occurrence word concept group, based on the similarity between the corresponding co-occurrence word concepts, clustering each of the basic word co-occurrence words for the selected basic word,
A polysemy extraction method, wherein when a plurality of clusters exist for the selected base word, the base word is extracted as a multi-word candidate.

Extraction of each word used in a document received from an input unit or a sentence constituting a document group, and extraction of word information related to part-of-speech and case for each word, particle to be combined, and dependency relation between words,
An arbitrary word is selected as a basic word among the words, and a basic word co-occurrence word that is regarded as a co-occurrence relationship with the basic word in an arbitrary basic word co-occurrence determination rule based on word information for each basic word and its word Extract the core word co-occurrence vector represented by the number of co-occurrence,
Collect and accumulate general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and general concept information related to the meaning and usage of words for inquiries about specific words Using the general concept information obtained as a response from the concept database that retrieves and responds, based on a predetermined concept estimation method, the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector is estimated,
For each basic word co-occurrence word for a specific basic word, the similarity between the corresponding co-occurrence word concepts is calculated by a predetermined similarity index, and each basic word co-occurrence word is calculated based on the similarity index between the co-occurrence word concepts. Perform clustering of words,
A polysemy extraction method characterized by extracting a base word in which a plurality of clusters having a cluster size of each base word co-occurrence word relating to each base word that are not less than a predetermined threshold are present as a multi-word candidate.

The control unit of the information processing device
A word analysis unit that extracts each word used in a predetermined sentence that has been input;
A base word that selects an arbitrary word from the words as a base word, and extracts a base word co-occurrence vector represented by the base word co-occurrence word and the co-occurrence number considered to be a co-occurrence relationship with the base word A co-occurrence vector extraction unit;
A co-occurrence word concept estimation unit that estimates a co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector from a general concept;
About the estimated co-occurrence word concept group, based on the similarity between corresponding co-occurrence word concepts, a co-occurrence word classification unit that performs clustering of each of the basic word co-occurrence words related to the selected basic word,
A multiple word candidate estimation unit that extracts the basic word as a multiple word candidate when a plurality of clusters exist with respect to the selected basic word;
A program that is operated as a multiple word candidate output unit that outputs the extracted multiple word candidates.

The control unit of the information processing device
A document input unit for receiving input of a target document or document group, and
A word analysis unit for extracting each word used in a sentence constituting a document or a document group, and extracting word information about a part of speech and a case for each word, a particle to be combined, and a dependency relation between words;
An arbitrary word is selected as a basic word among the words, and a basic word co-occurrence word that is regarded as a co-occurrence relationship with the basic word in an arbitrary basic word co-occurrence determination rule based on word information for each basic word and its word A basic word co-occurrence vector extracting unit for extracting a basic word co-occurrence vector represented by the co-occurrence number;
Collect and accumulate general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and general concept information related to the meaning and usage of words for inquiries about specific words Co-occurrence that estimates the co-occurrence word concept of each basic word co-occurrence vector of the basic word co-occurrence vector based on a predetermined concept estimation method using general concept information obtained as a response from the concept database that searches and responds A word concept estimator;
For each basic word co-occurrence word for a specific basic word, the similarity between the corresponding co-occurrence word concepts is calculated by a predetermined similarity index, and each basic word co-occurrence word is calculated based on the similarity index between the co-occurrence word concepts. A co-occurrence classifier that clusters words,
A multiple word candidate estimator for extracting, as a multiple word candidate, a basic word in which a plurality of clusters equal to or larger than a threshold that is arbitrarily determined by the size of each cluster of each basic word co-occurrence word for each basic word;
A program that is operated as a multiple word candidate output unit that outputs the extracted multiple word candidates.