JP2013020431A - Polysemic word extraction system, polysemic word extraction method and program - Google Patents

Polysemic word extraction system, polysemic word extraction method and program Download PDF

Info

Publication number
JP2013020431A
JP2013020431A JP2011152983A JP2011152983A JP2013020431A JP 2013020431 A JP2013020431 A JP 2013020431A JP 2011152983 A JP2011152983 A JP 2011152983A JP 2011152983 A JP2011152983 A JP 2011152983A JP 2013020431 A JP2013020431 A JP 2013020431A
Authority
JP
Japan
Prior art keywords
word
occurrence
basic
concept
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2011152983A
Other languages
Japanese (ja)
Other versions
JP5754018B2 (en
Inventor
Eiji Hirao
英司 平尾
Takeshi Furuhashi
武 古橋
Ohiro Yoshikawa
大弘 吉川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nagoya University NUC
NEC Corp
Original Assignee
Nagoya University NUC
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nagoya University NUC, NEC Corp filed Critical Nagoya University NUC
Priority to JP2011152983A priority Critical patent/JP5754018B2/en
Publication of JP2013020431A publication Critical patent/JP2013020431A/en
Application granted granted Critical
Publication of JP5754018B2 publication Critical patent/JP5754018B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PROBLEM TO BE SOLVED: To provide a polysemic word extraction system for removing ambiguity of text by identifying a polysemic word, which is used in a meaning different from a general meaning, in a document group of a specific project such as a proposal or specifications for information system construction etc.SOLUTION: The polysemic word extraction system includes: a word analyzing section that extracts words from a given input text; a key word co-occurrence vector extraction section that selects a desired word as a key word and extracts key word co-occurrence words which are in a co-occurrence relation with the key word and a key word co-occurrence vector represented with the number of co-occurrences thereof; a co-occurrence word concept estimation section that estimates a co-occurrence word concept of each key word co-occurrence word of the key word co-occurrence vector on the basis of a general concept; a co-occurrence word classification section that performs clustering of key word co-occurrence words with respect to the selected key word based on the similarity among the corresponding co-occurrence word concepts on the estimated co-occurrence word concept group; a polysemic word candidate estimation section that extracts a polysemic word candidate when plural clusters are found,; and a polysemic word candidate output section that outputs the extracted candidate.

Description

本発明は、多義語抽出システム、多義語抽出方法およびプログラムに関し、特に、情報システム構築に関する提案書や仕様書等といった 所定の案件に関する文書内で複数の意味を割り当てられている多義語を抽出する多義語抽出システム、方法およびプログラムに関する。   The present invention relates to a polysemy extraction system, a polysemy extraction method, and a program, and in particular, extracts a polysemy that is assigned a plurality of meanings in a document related to a predetermined matter such as a proposal or a specification for information system construction. The present invention relates to a polysemy extraction system, method, and program.

近年、情報処理装置を用いて、自然言語で書かれた文書を分析して、その文書の意味や意義を自動抽出するシステムが開発されている。そのなかで、文章中の多義語の取り扱いが問題になることがある。   In recent years, a system has been developed that uses an information processing apparatus to analyze a document written in a natural language and automatically extract the meaning and significance of the document. Among them, the handling of ambiguous words in sentences may become a problem.

多義語抽出システムに関する技術の一例が、特許文献1に「単語シソーラス構築システム」として記載されている。この特許文献1に開示された単語シソーラス構築システムは、文章解析部、名詞間距離計算部、名詞クラスタリング部、多義性解消部、再クラスタリング部、シソーラス生成部、データ格納部から構成されている。このような構成を有する単語シソーラス構築システムは、次のように動作する。
すなわち、文章解析部は、解析対象とするコーパス中の文章の形態素解析及び構文解析を実行して動詞各関係基礎データを生成し、名詞リスト、動詞リスト及び共起関係データを生成する。名詞間距離計算部は、生成した各リスト、及び共起関係データに基づいて名詞間距離を求める。名詞クラスタリング部は、計算された名詞間距離に基づいて名詞クラスタを生成する。多義性解消部は、この名詞クラスタの有するツリー構造に基づいて各名詞と共起関係のある動詞の多義性を解消し前記動詞リスト及び共起関係データを修正する。再クラスタリング部は、この多義性解消部によって修正された動詞リスト及び共起関係データに基づいて再度名詞クラスタリングを実行する。シソーラス生成部は、この再クラスタリング結果に基づいて単語のシソーラスを生成する。データ格納部は、解析対象である大量の文章であるコーパスと、このコーパスを解析することで生成された動詞格関係基礎データと、文章中に出現した動詞をその出現頻度と共に格納する動詞リストと、文書中に出現する名詞をその出現頻度と共に格納する名詞リストと、前記各リスト中の動詞と名詞の共起関係を格納する共起関係データと、名詞間距離計算部によって求められる名詞間距離と、クラスタリング処理によって生成された名詞クラスタと、シソーラス生成処理によって生成された名詞及び動詞のシソーラスとを格納する。このような構成により、文書中の単語について、動詞と名詞の単語間距離とに基づいて、動詞の多義性を判断し、この判断に基づいて単語リスト及び共起関係データを修正し、これに基づいて再度名詞のクラスタリングを行うことで、精度の高いシソーラスが構築できるとしている。
An example of a technique related to a polysemy extraction system is described in Patent Document 1 as a “word thesaurus construction system”. The word thesaurus construction system disclosed in Patent Document 1 includes a sentence analysis unit, an internoun distance calculation unit, a noun clustering unit, an ambiguity elimination unit, a reclustering unit, a thesaurus generation unit, and a data storage unit. The word thesaurus construction system having such a configuration operates as follows.
That is, the sentence analysis unit executes morphological analysis and syntactic analysis of sentences in the corpus to be analyzed, generates verb relation basic data, and generates a noun list, verb list, and co-occurrence relation data. The internoun distance calculation unit obtains the internoun distance based on each generated list and co-occurrence relation data. The noun clustering unit generates a noun cluster based on the calculated internoun distance. The ambiguity elimination unit eliminates the ambiguity of the verb having a co-occurrence relationship with each noun based on the tree structure of the noun cluster, and corrects the verb list and the co-occurrence relationship data. The re-clustering unit performs noun clustering again based on the verb list and the co-occurrence relation data corrected by the ambiguity eliminating unit. The thesaurus generator generates a word thesaurus based on the reclustering result. The data storage unit is a corpus that is a large amount of sentences to be analyzed, basic verb case relationship data generated by analyzing the corpus, a verb list that stores verbs that appear in the sentences together with their appearance frequencies, and , A noun list for storing the nouns appearing in the document together with their appearance frequency, co-occurrence relation data for storing the co-occurrence relations of verbs and nouns in each list, and the distance between nouns calculated by the internoun distance calculation unit And a noun cluster generated by the clustering process and a noun and verb thesaurus generated by the thesaurus generation process. With such a configuration, the ambiguity of the verb is determined for the words in the document based on the distance between the verb and the noun, and the word list and the co-occurrence relation data are corrected based on this determination. It is said that a highly accurate thesaurus can be constructed by clustering nouns again based on this.

さらに、多義語抽出システムに関する技術の他の例が、特許文献2に「機械翻訳装置」として記載されている。この特許文献2に開示された機械翻訳装置は、入力部、入力文字列記憶部、翻訳辞書部、辞書検索部、翻訳処理部、知識ベース部、単語シソーラス部、多義性解消部、翻訳結果出力部から構成されている。このような構成を有する機械翻訳装置は、次のように動作する。
入力部は、原言語文字列を入力する。次に、入力文字列記憶部は、入力された文字列を記憶する。翻訳辞書部は、原言語単語と相手言語の形態情報、原言語と相手言語の対訳情報などを保持している。辞書検索部は、翻訳辞書を検索する。翻訳処理部は、原言語を前記翻訳辞書部を参照して他の言語に翻訳し、翻訳処理において多義性を識別したとき、多義性解消部に多義性の解消を指示する。知識ベース部は、原言語における単語間の共起関係と、それに対応する相手言語の表現とを集める。単語シソーラス部は、意味的に類似した単語を記憶する。多義性解消部は、入力文字列を相手言語に翻訳するときに生じる多義性を解消する。そのため、多義性解消部は前記知識ベースを参照して訳語を検出し、検出できないときは前記単語シソーラス部中の意味類似単語に置き換えた原文で前記知識ベースを検索して訳語を検出し、さらに検出できないときは、訳語を頻度により決定する。翻訳結果出力部は、翻訳処理結果を出力する。このような構成により、訳語に多義性が発生したとき、知識ベースの規模が、単語シソーラスで補われて、等価的に大きな知識ベースをもとに多義性を解消することを実現している。
Furthermore, another example of the technique related to the polysemy extraction system is described in Patent Document 2 as “Machine Translation Device”. The machine translation device disclosed in Patent Document 2 includes an input unit, an input character string storage unit, a translation dictionary unit, a dictionary search unit, a translation processing unit, a knowledge base unit, a word thesaurus unit, an ambiguity elimination unit, and a translation result output. It consists of parts. The machine translation apparatus having such a configuration operates as follows.
The input unit inputs a source language character string. Next, the input character string storage unit stores the input character string. The translation dictionary section holds source language words and partner language form information, source language and partner language parallel translation information, and the like. The dictionary search unit searches the translation dictionary. When the translation processing unit translates the source language into another language with reference to the translation dictionary unit and identifies the ambiguity in the translation process, the translation processing unit instructs the ambiguity resolution unit to cancel the ambiguity. The knowledge base unit collects co-occurrence relationships between words in the source language and corresponding counterpart language expressions. The word thesaurus section stores semantically similar words. The ambiguity resolution unit eliminates the ambiguity that occurs when the input character string is translated into the partner language. Therefore, the ambiguity resolution unit detects a translation word by referring to the knowledge base, and when it cannot detect, it searches the knowledge base with the original sentence replaced with the semantically similar word in the word thesaurus part, detects a translation word, When it cannot be detected, the translated word is determined by frequency. The translation result output unit outputs the translation processing result. With such a configuration, when ambiguity occurs in a translated word, the scale of the knowledge base is supplemented by a word thesaurus, and the ambiguity is resolved based on an equivalently large knowledge base.

特開2001−331515号公報JP 2001-331515 A 特開平05−158970号公報JP 05-158970 A

上記のような技術の問題は、情報システム構築に関する提案書や仕様書等といった 所定の案件に関する文書内で複数の意味を割り当てられている多義語の抽出に、例示した技術による多義語の抽出方法を適用すると、多義語の抽出率が低くなってしまうことが挙げられる。   The above-mentioned technical problems can be solved by extracting multiple meanings using the exemplified techniques in the extraction of multiple meanings assigned multiple meanings in documents related to a given project such as proposals and specifications related to information system construction. If is applied, the extraction rate of ambiguous words may be lowered.

その理由は、このような多義語が使用されている文書の多くは、一般的に文章量が限られているため任意の単語に対する共起語として同一の単語が出現する可能性が低く、大量のコーパスを前提とした特許文献1の手法で用いられているような共起語のクラスタリングを行うことが難しいためである。換言すれば、特許文献1の手法では、少量のコーパスに基づいて共起語をクラスタ化したとしても、所望の精度での多義語の抽出が行なえない課題を有している。   The reason for this is that many of the documents that use such polysemy are generally limited in the amount of sentences, so it is unlikely that the same word will appear as a co-occurrence word for an arbitrary word. This is because it is difficult to perform clustering of co-occurrence words as used in the method of Patent Document 1 on the assumption of the corpus of In other words, the technique of Patent Document 1 has a problem that even if co-occurrence words are clustered based on a small amount of corpus, it is not possible to extract a multiple meaning word with a desired accuracy.

また、別の観点での問題は、情報システム構築に関する提案書や仕様書等といった 所定の案件に関する文書内で複数の意味を割り当てられている多義語の抽出に、例示した技術による多義語の抽出方法を適用すると、特定の案件に関する文書群で成り立つている多義語を抽出することができないことである。   Another problem is the extraction of multiple meanings using the exemplified technology in the extraction of multiple meanings that are assigned multiple meanings in documents related to a given project, such as proposals and specifications related to information system construction. When the method is applied, it is impossible to extract a polysemy consisting of a document group related to a specific case.

その理由は、このような多義語は、事前にその同義関係を把握することが難しく、特許文献2の手法で用いられているような翻訳辞書によって多義性のために訳しわけを行なう必要のある箇所の有無を判断することが困難であるためである。このため、特殊な多義語用に既存辞書とは別にシソーラスを準備するなどの対策が必要と成る。しかし、そのシソーラスを準備するためには多大な負担が必要となってしまう。   The reason for this is that it is difficult to grasp the synonym relations in advance for such ambiguous words, and it is necessary to translate them for ambiguousness using the translation dictionary used in the technique of Patent Document 2. This is because it is difficult to determine the presence or absence of a location. For this reason, it is necessary to take measures such as preparing a thesaurus separately from existing dictionaries for special polysemy. However, a great burden is required to prepare the thesaurus.

そこで、特定の範囲で用いられている独特な多義語を含む文書中からその多義語を所要に抽出することを課題とする。   Then, it makes it a subject to extract the polysemy required from the document containing the unique polysemy used in the specific range.

本発明の目的は、上記事柄に鑑み、情報システム構築に関する提案書や仕様書等といった 所定の案件に関する文書内で複数の意味を割り当てられている多義語を抽出する、多義語抽出システム、方法およびプログラムを提供することにある。   In view of the above, the object of the present invention is to provide a multiple meaning extraction system, method, and method for extracting multiple meanings assigned with a plurality of meanings in a document related to a predetermined matter such as a proposal or specification regarding information system construction. To provide a program.

本発明に係る多義語抽出システムは、入力を受けた所定の文章に使用されている各単語の抽出を行う単語分析部と、前記単語の内で任意の単語を基軸単語として選択し、該基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルを抽出する基軸単語共起ベクトル抽出部と、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を一般概念から推定する共起語概念推定部と、推定した共起語概念群について、対応する共起語概念間の類似性に基づき、前記選択した基軸単語に関する各基軸単語共起語のクラスタリングを行う共起語分類部と、前記選択した基軸単語に関して複数のクラスタが存在した際に該基軸単語を多義語候補として抽出する多義語候補推定部と、抽出した多義語候補を出力する多義語候補出力部とを備えることを特徴とする。   The system for extracting multiple meanings according to the present invention includes a word analysis unit that extracts each word used in a predetermined sentence that has been input, selects an arbitrary word from the words as a base word, and the base A basic word co-occurrence vector extraction unit that extracts a basic word co-occurrence vector represented by a basic word co-occurrence word and the number of co-occurrence considered as a co-occurrence relationship with the word, and each basic word co-occurrence vector of the basic word co-occurrence vector The co-occurrence word concept estimation unit that estimates the co-occurrence word concept of the word from the general concept, and the estimated co-occurrence word concept group, each of the selected key word based on the similarity between the corresponding co-occurrence word concepts A co-occurrence word classifying unit that performs clustering of key word co-occurrence words, a polysemy candidate estimating unit that extracts the base word as a multi-word candidate when a plurality of clusters exist for the selected key word, and the extracted polysemy Output many word candidates Characterized in that it comprises a word candidate output unit.

本発明によれば、情報システム構築に関する提案書や仕様書等といった 所定の案件に関する文書内で複数の意味を割り当てられている多義語を抽出する、多義語抽出システム、方法およびプログラムを提供できる。   ADVANTAGE OF THE INVENTION According to this invention, the multiple meaning word extraction system, method, and program which extract the multiple meaning word assigned with several meanings in the document regarding a predetermined | prescribed matter, such as a proposal regarding an information system construction, a specification, etc., can be provided.

第1の実施形態に係る多義語抽出システムの構成を示すブロック図である。It is a block diagram which shows the structure of the polysemy extraction system which concerns on 1st Embodiment. 図1に示した多義語抽出システムの動作例を示す流れ図である。It is a flowchart which shows the operation example of the multiple meaning word extraction system shown in FIG. 第2の実施形態に係る多義語抽出システムの構成を示すブロック図である。It is a block diagram which shows the structure of the polysemy extraction system which concerns on 2nd Embodiment. 図3に示した多義語抽出システムの動作例を示す流れ図である。It is a flowchart which shows the operation example of the multiple meaning word extraction system shown in FIG. 第1の実施例に係る多義語抽出システムの構成を示すブロック図である。It is a block diagram which shows the structure of the polysemy extraction system which concerns on a 1st Example. 基軸単語共起ベクトルNiをまとめた表の例を示す説明図である。It is explanatory drawing which shows the example of the table | surface which put together the basic word co-occurrence vector Ni. 基軸単語共起語Vijに関するシソーラスの一般概念情報Cgの分類体系例を示す説明図である。It is explanatory drawing which shows the example of a classification | category system of the general concept information Cg of the thesaurus regarding the basic word co-occurrence word Vij. 基軸単語「資材」に関する共起語概念図Cvwjを樹形図として表した例である。It is the example which represented co-occurrence word conceptual diagram Cvwj regarding basic word "material" as a tree diagram. 周辺語構成表VVの例を示す説明図である。It is explanatory drawing which shows the example of the peripheral word structure table VV. 周辺語Vvwjfに関するシソーラスの一般概念情報Cgの分類体系の例を示す説明図である。It is explanatory drawing which shows the example of the classification system of the general concept information Cg of the thesaurus regarding the peripheral word Vvwjf. 基軸単語「資材」の共起語の周辺語Vvwjfに基づく大分類の共起語概念表VC1を示す説明図である。It is explanatory drawing which shows the large-category co-occurrence word concept table VC1 based on the peripheral word Vvwjf of the co-occurrence word of the basic word “material”. 基軸単語「資材」の共起語の周辺語Vvwjfに基づく中分類の共起語概念表VC2を示す説明図である。It is explanatory drawing which shows the co-occurrence word concept table | surface VC2 of the middle classification based on the peripheral word Vvwjf of the co-occurrence word of basic word "material". 基軸単語「資材」の共起語の周辺語Vvwjfに基づく小分類の共起語概念表VC3を示す説明図である。It is explanatory drawing which shows the co-occurrence word concept table | surface VC3 of the small classification based on the peripheral word Vvwjf of the co-occurrence word of basic word "material". 基軸単語「資材」に関する共起語概念図Cvwjの樹形図に基づくクラスタリング結果の一例を示す説明図である。It is explanatory drawing which shows an example of the clustering result based on the dendrogram of co-occurrence word conceptual diagram Cvwj regarding basic word "material". 基軸単語「資材」に関する共起語概念図Cvwjのデンドログラムに基づくクラスタリング結果の一例を示す説明図である。It is explanatory drawing which shows an example of the clustering result based on the dendrogram of co-occurrence word conceptual diagram Cvwj regarding basic word "material". 第2の実施例に係る多義語抽出システムの構成を示すブロック図である。It is a block diagram which shows the structure of the polysemy extraction system which concerns on a 2nd Example. 構成語「処理」を含む部分一致複合語共起表VUxの例を示す説明図である。It is explanatory drawing which shows the example of the partial coincidence compound word co-occurrence table VUx containing composition word "process." 構成語「変更」を含む部分一致複合語共起表VUxの例を示す説明図である。It is explanatory drawing which shows the example of the partial coincidence compound word co-occurrence table VUx containing a constituent word "change." 複合語「変更処理」に関する複合語構成配分表Teの例を示す説明図である。It is explanatory drawing which shows the example of the compound word structure distribution table Te regarding compound word "change process". 複合語を考慮した基軸単語共起ベクトルNiをまとめた表の例を示す説明図である。It is explanatory drawing which shows the example of the table | surface which put together the basic word co-occurrence vector Ni which considered the compound word. 複合語を考慮した基軸単語共起語Vijに関するシソーラスの一般概念情報Cgの分類体系例を示す説明図である。It is explanatory drawing which shows the example of a classification | category system of the thesaurus general concept information Cg regarding the basic word co-occurrence word Vij which considered the compound word. 基軸単語「資材」に関して、複合語を考慮した共起語概念図Cvwjを樹形図として表した例である。It is the example which represented co-occurrence word conceptual diagram Cvwj which considered the compound word as a tree diagram regarding basic word "material". 基軸単語「資材」に関して、複合語を考慮した共起語概念図Cvwjの樹形図に基づくクラスタリング結果の一例を示す説明図である。It is explanatory drawing which shows an example of the clustering result based on the tree diagram of co-occurrence word conceptual diagram Cvwj which considered the compound word regarding basic word "material".

[実施形態1]
最初に、本発明の第1の実施形態について、図面を参照して詳細に説明する。
図1は、本発明の第1の実施形態に係る多義語抽出システム100の構成を示すブロック図である。
[Embodiment 1]
First, a first embodiment of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a polysemy extraction system 100 according to the first embodiment of the present invention.

図1を参照すると、本発明の第1の実施形態に係る多義語抽出システム100は、基本的に電子機器内もしくはサーバと電子機器およびこれらを相互に接続するインターネット等の情報通信ネットワークからなるシステム内に、少なくとも、文書入力部10、単語分析部20、基軸単語共起ベクトル抽出部30、共起語概念推定部40、共起語分類部50、多義語候補推定部60、多義語候補出力部70、概念データベース110と、を含む。
図示の多義語抽出システム100は、情報システム構築に関する提案書や仕様書等といった 所定の案件に関する文書内で複数の意味を割り当てられている多義語を抽出する多義語抽出システムである。
Referring to FIG. 1, a multiple meaning extraction system 100 according to the first embodiment of the present invention is basically a system comprising an electronic device or a server and an electronic device, and an information communication network such as the Internet for interconnecting them. Among them, at least the document input unit 10, the word analysis unit 20, the basic word co-occurrence vector extraction unit 30, the co-occurrence word concept estimation unit 40, the co-occurrence word classification unit 50, the multiple meaning word candidate estimation unit 60, and the multiple meaning word candidate output Unit 70 and concept database 110.
The illustrated multiple meaning word extraction system 100 is a multiple meaning word extraction system that extracts a multiple meaning word assigned a plurality of meanings in a document related to a predetermined item such as a proposal or specification regarding information system construction.

電子機器で多義語抽出システムを構成する場合、多義語抽出システム100は、プログラム制御により動作するコンピュータで実現可能である。図示はしないが、この種のコンピュータは、周知のように、データを入力する入力装置と、データ処理装置と、データ処理装置での処理結果を出力する出力装置と、種々のデータベースとして働く補助記憶装置とを備えている。そして、データ処理装置は、プログラムを記憶するリードオンリメモリ(ROM)と、データを一時的に記憶するワークエリアとして使用されるランダムアクセスメモリ(RAM)と、ROMに記憶されたプログラムに従って、RAMに記憶されているデータを処理する中央処理装置(CPU)とから構成される。
この場合、データ処理装置が、文書入力部10、単語分析部20、基軸単語共起ベクトル抽出部30、共起語概念推定部40、共起語分類部50、多義語候補推定部60として働き、補助記憶装置が概念データベース110として動作し、出力装置が多義語候補出力部70として働く。
When a polysemy extraction system is comprised with an electronic device, the polysemy extraction system 100 is realizable with the computer which operate | moves by program control. Although not shown, this type of computer, as is well known, includes an input device for inputting data, a data processing device, an output device for outputting processing results in the data processing device, and an auxiliary memory serving as various databases. Device. Then, the data processing device stores data in a read-only memory (ROM) that stores a program, a random access memory (RAM) that is used as a work area that temporarily stores data, and a program stored in the ROM. It consists of a central processing unit (CPU) that processes stored data.
In this case, the data processing device functions as the document input unit 10, the word analysis unit 20, the basic word co-occurrence vector extraction unit 30, the co-occurrence word concept estimation unit 40, the co-occurrence word classification unit 50, and the multiple meaning word candidate estimation unit 60. The auxiliary storage device operates as the concept database 110, and the output device functions as the polysemy candidate output unit 70.

次に、多義語抽出システム100を構成する各構成要素の動作について説明する。   Next, the operation of each component constituting the polysemy extraction system 100 will be described.

文書入力部10は、多義語を抽出する対象とする文書もしくは文書群の入力を受け付ける。   The document input unit 10 receives an input of a document or a document group from which an ambiguous word is extracted.

単語分析部20は、文書もしくは文書群を構成する各文章に形態素解析や構文解析などを適用することで、各文章に使用されている名詞、動詞、形容詞、形容動詞など単独で意味をなす自立語を単語として抽出し、さらに必要に応じて単語毎の品詞や直後に使用された助詞の種類、単語間の係り受け関係などの単語情報の抽出を行う。なお、自立語ではなく形態素をそのまま使用するようにしてもよい。   The word analysis unit 20 applies morphological analysis or syntactic analysis to each sentence constituting a document or a group of documents, so that nouns, verbs, adjectives, adjective verbs and the like used in each sentence are independent. Words are extracted as words, and word information such as part-of-speech for each word, type of particle used immediately after, and dependency relation between words is extracted as necessary. Note that morphemes may be used as they are instead of independent words.

基軸単語共起ベクトル抽出部30は、単語分析部20で抽出された各文章に使用されている任意の単語を基軸単語として順次選択し、基軸単語毎の単語情報などを用いて任意の基軸単語共起判定ルールで基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルをそれぞれ抽出する。ここで、前記基軸単語共起判定ルールとしては、基軸単語と係り受けの関係にある単語を共起語と見なすルールや、基軸単語と同一の文内で特定の助詞を伴って使用されている単語を共起語と見なすルールなどが考えられる。また、共起数は共起回数でも良いが、共起回数を基軸単語毎の全共起語数で除した頻度などでも良い。また、基軸単語共起語とその共起数について、抽出元とする所定文書について、重要度や確度、文書間の親子関係などに基づく重み付けを行なうようにしても良い。   The base word co-occurrence vector extraction unit 30 sequentially selects any word used in each sentence extracted by the word analysis unit 20 as a base word, and uses any word information or the like for each base word. A base word co-occurrence vector represented by a base word co-occurrence word that is regarded as a co-occurrence relationship with the base word in the co-occurrence determination rule and the number of the co-occurrence are extracted. Here, as the basic word co-occurrence determination rule, a word that has a dependency relationship with the basic word is used as a co-occurrence word, or a specific particle is used in the same sentence as the basic word. A rule that considers a word as a co-occurrence word can be considered. The number of co-occurrence may be the number of co-occurrence, but may be a frequency obtained by dividing the number of co-occurrence by the total number of co-occurrence words for each basic word. Further, with respect to the basic word co-occurrence word and the number of co-occurrence words, weighting may be performed on a predetermined document as an extraction source based on importance, accuracy, parent-child relationship between documents, and the like.

概念データベース110は、収集された単語の概念分類および一般的な同義語、類義語、用法などの一般概念情報を蓄積し、特定の単語に関する問い合わせに対して、単語の意味や用法に関連する一般概念情報を検索し応答するデータベースである。概念データベース110は、単語の上位/下位関係、部分/全体関係、同義関係、類義関係などによって単語を分類し、体系づけたシソーラスなどが相当する。概念データベース110としてインターネット上のデータベースを使用することとしてもよい。   The concept database 110 accumulates general concept information such as concept classification and general synonyms, synonyms, usages, etc. of collected words, and general concepts related to the meaning and usage of words in response to inquiries about specific words. A database that retrieves and responds to information. The concept database 110 corresponds to a thesaurus that organizes and organizes words according to the upper / lower relationship, partial / whole relationship, synonym relationship, synonym relationship, and the like of words. A database on the Internet may be used as the concept database 110.

共起語概念推定部40は、概念データベース110の一般概念情報を利用して、所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する。   The co-occurrence word concept estimation unit 40 uses the general concept information in the concept database 110 to estimate the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector based on a predetermined concept estimation method.

前記概念推定方法としては、直接、各基軸単語共起語に関する一般概念情報を概念データベース110に問い合わせ、特定の基軸単語の全基軸単語共起語を一般概念情報に基づく一般概念に置き換えた基軸単語共起概念ベクトルを共起語概念とする方法が良い。概念への置き換えで異なる基軸単語共起語が同じ一般概念となる場合はそれぞれの基軸単語共起語を合流し、共起数の和を対応箇所へ登録する。また、概念データベース110として大分類、中分類、小分類のような複数の階層での概念が一般概念情報として登録されたシソーラスを用いる場合、階層毎に基軸単語共起概念ベクトルを作成し、大分類など広い概念での基軸単語共起概念ベクトルで異なる共起語が同じ概念となる場合は、それぞれの共起語を合流させて、共起数の和を対応箇所へ登録する。他に、概念データベース110として同義語を含む類義語群が一般概念情報として登録された類語辞書を用いた場合、基軸単語共起語を対応する類義語群の各類義語に変換し、各類義語の共起数として対応する基軸単語共起語の共起数を割り当て、同一の基軸単語の基軸単語共起語に関して変換された類義語毎の共起数の延べ数を基軸単語共起概念ベクトルとして算出しても良い。なお、概念データベース110に基軸単語共起語に対応する概念が無い場合、前記共起語を概念に変換せず、共起語の単語をそのまま概念として扱い残す。   As the concept estimation method, the basic word is directly inquired of the concept database 110 for general concept information about each basic word co-occurrence word, and all basic word co-occurrence words of a specific basic word are replaced with the general concept based on the general concept information. A method of using a co-occurrence concept vector as a co-occurrence word concept is preferable. When different basic word co-occurrence words become the same general concept by replacement with the concept, the respective basic word co-occurrence words are merged, and the sum of the co-occurrence numbers is registered in the corresponding location. Further, when a thesaurus in which concepts at a plurality of levels such as major classification, middle classification, and minor classification are registered as general concept information is used as the concept database 110, a basic word co-occurrence concept vector is created for each hierarchy. When different co-occurrence words become the same concept in the basic word co-occurrence concept vector in a broad concept such as classification, the respective co-occurrence words are merged and the sum of the co-occurrence numbers is registered in the corresponding location. In addition, when a synonym dictionary in which synonyms including synonyms are registered as general concept information is used as the concept database 110, the key word co-occurrence words are converted into the corresponding synonyms of the corresponding synonyms and the co-occurrence of the synonyms The co-occurrence number of the corresponding base word co-occurrence word is assigned as a number, and the total number of co-occurrence numbers for each synonym converted with respect to the base word co-occurrence word of the same base word is calculated as the base word co-occurrence concept vector good. If there is no concept corresponding to the basic word co-occurrence word in the concept database 110, the co-occurrence word is not converted to the concept, and the word of the co-occurrence word is treated as a concept as it is.

また前記概念推定方法の他の例としては、基軸単語共起語について任意の周辺語判定ルールで基軸単語共起語の周辺に存在する周辺語とその存在数に基づく周辺語構成ベクトルを全基軸単語共起語についてまとめた周辺語構成表を作成し、周辺語構成表の周辺語構成ベクトルにおける各周辺語のそれぞれについて、概念データベース110に一般概念情報を問い合わせ、任意の範囲内で周辺語構成表における各周辺語構成ベクトルの各周辺語を一般概念に変換した周辺語概念ベクトルを対応する基軸単語共起語毎に作成し、特定の基軸単語の全基軸単語共起語に対応する周辺概念ベクトルをまとめた基軸単語共起概念表を共起語概念とする方法でも良い。
ここで、前記周辺語判定ルールとしては1文、1段落内の全文章、目次上の同一項目内での全文章、文書全体など、文書の特徴に合わせて周辺と見なす範囲を設定して良く、1文内で共存する動詞、および目次上の同一項目内の文章内の名詞のように品詞毎に周辺とみなす範囲を変えても良い。さらに、単語間の係り受け関係のある単語かどうかを前記周辺語判定ルールとして利用しても良い。また、存在数は存在個数でも良いが、存在個数を基軸単語共起語毎の全周辺語数で除した頻度などでも良い。また、周辺語構成表とは各行が各基軸単語共起語に、各列が各周辺語に対応している行列で、基軸単語共起語に対する周辺語の存在数が表の各値として登録されたものである。概念への変換で異なる周辺語が同じ概念となる場合はそれぞれの周辺語を合流し、存在数の和を対応箇所へ登録する。また、概念データベース110として大分類、中分類、小分類のような複数の階層での概念が一般概念情報として登録されたシソーラスを用いる場合、階層毎に基軸単語共起概念表を作成し、大分類など広い概念での基軸単語共起概念表で異なる周辺語が同じ概念となる場合は、それぞれの周辺語を合流し、存在数の和を対応箇所へ登録する。他に、概念データベース110として同義語を含む類義語群が一般概念情報として登録された類語辞書を用いた場合、周辺語を対応する類義語群の各類義語に変換し、各類義語の存在数として対応する周辺語の存在数を割り当て、同一の基軸単語共起語の周辺語に関して変換された類義語毎の共起数の延べ数を周辺語概念ベクトルとして算出し、特定の基軸単語の全基軸単語共起語に対応する周辺概念ベクトルをまとめることで基軸単語共起概念表を作成しても良い。なお、概念データベース110に周辺語に対応する概念が無い場合、前記共起語を概念に変換せず、共起語の単語をそのまま仮の概念として扱い残す。
In addition, as another example of the concept estimation method, the peripheral word composition vector based on the peripheral words existing in the vicinity of the basic word co-occurrence word and the number of existing peripheral words based on the arbitrary peripheral word determination rule for the basic word co-occurrence word A peripheral word composition table summarizing the word co-occurrence words is created, the general concept information is inquired to the concept database 110 for each of the peripheral words in the peripheral word composition vector of the peripheral word composition table, and the peripheral word composition within an arbitrary range Peripheral concept vectors created by converting each peripheral word of each peripheral word composition vector in the table into general concepts for each corresponding basic word co-occurrence word, and corresponding peripheral concepts corresponding to all basic word co-occurrence words of a specific basic word A basic word co-occurrence concept table in which vectors are collected may be used as a co-occurrence word concept.
Here, as the peripheral word determination rule, a range to be regarded as a peripheral according to the characteristics of the document, such as one sentence, all sentences within one paragraph, all sentences within the same item on the table of contents, or the whole document, may be set. The range regarded as the periphery may be changed for each part of speech such as a verb coexisting in one sentence and a noun in a sentence in the same item on the table of contents. Further, whether or not a word has a dependency relationship between words may be used as the peripheral word determination rule. The existence number may be the existence number, or may be a frequency obtained by dividing the existence number by the total number of peripheral words for each basic word co-occurrence word. The peripheral word composition table is a matrix in which each row corresponds to each basic word co-occurrence word and each column corresponds to each peripheral word, and the number of peripheral words for the basic word co-occurrence word is registered as each value of the table. It has been done. When different peripheral words become the same concept in the conversion to the concept, the respective peripheral words are merged and the sum of the number of existences is registered in the corresponding part. Further, when a thesaurus in which concepts at a plurality of levels such as major classification, middle classification, and minor classification are registered as general concept information is used as the concept database 110, a basic word co-occurrence concept table is created for each hierarchy. When different peripheral words become the same concept in the basic word co-occurrence concept table in a broad concept such as classification, the respective peripheral words are merged and the sum of the number of existences is registered in the corresponding location. In addition, when a synonym dictionary in which synonyms including synonyms are registered as general concept information is used as the concept database 110, peripheral words are converted into the corresponding synonyms of the corresponding synonyms and corresponding as the number of existing synonyms. Assigns the number of neighboring words, calculates the total number of co-occurrence for each synonym converted for the neighboring words of the same basic word co-occurrence word as a peripheral word concept vector, and all the basic word co-occurrence words of a specific basic word A basic word co-occurrence concept table may be created by collecting peripheral concept vectors corresponding to. If there is no concept corresponding to the peripheral word in the concept database 110, the co-occurrence word is not converted into a concept, and the word of the co-occurrence word is left as a temporary concept.

共起語分類部50は、特定の基軸単語に関する基軸単語共起語の各共起語概念について、所定の類似性指標によって対応する共起語概念間の類似性を算出し、前記共起語概念間の類似性指標に基づき各基軸単語共起語のクラスタリングを行う。ここで、類似性の判定を行う「類似性指標」は共起語概念間の意味的な類似性を判断する基準であれば良く、例えば、共起語概念が、特定の基軸単語の全基軸単語共起語を一般概念情報に基づく一般概念に置き換えた基軸単語共起概念ベクトルで、概念データベース110として前記シソーラスを用いる場合、全基軸単語共起語が同一の一般概念と見なされるまでの分類の深さが類似性指標として有効であり、概念データベース110として前記類語辞書を用いた場合、基軸単語共起語に関して変換された類義語毎の共起数の延べ数で作成した基軸単語共起概念ベクトル間のコサイン距離やユークリッド距離などの距離と単調減少の関係にある関数値が類似性指標として適当である。また、共起語概念が、特定の基軸単語の全基軸単語共起語に対応する前記周辺語概念ベクトルをまとめた前記基軸単語共起概念表で、概念データベース110として前記シソーラスを用いる場合、階層毎に各基軸単語共起語に対応する前記周辺語概念ベクトル間のコサイン距離やユークリッド距離などを算出し、小分類などより詳細な深い分類での距離ほど重視するように重み付けした距離と単調減少の関係にある関数値が類似性指標として適切である。また、クラスタリングの手法は一般的な手法で良く、デンドログラムなどを用いた階層的クラスタリングを適用しても良いし、周辺単語間の類似性と単調減少する指標を距離として導かれる仮想的な周辺単語の位置情報にk−means法やFussy−c−means法などの非階層的クラスタリングを適用しても良い。   The co-occurrence word classifying unit 50 calculates the similarity between corresponding co-occurrence word concepts for each co-occurrence word concept of the basic word co-occurrence word related to the specific basic word by a predetermined similarity index, and the co-occurrence word Based on the similarity index between concepts, each key word co-occurrence word is clustered. Here, the “similarity index” for determining similarity may be a standard for determining the semantic similarity between co-occurrence word concepts. For example, the co-occurrence word concept includes all basic axes of a specific basic word. A basic word co-occurrence concept vector in which a word co-occurrence word is replaced with a general concept based on general concept information, and when the thesaurus is used as the concept database 110, classification until all basic word co-occurrence words are regarded as the same general concept When the synonym dictionary is used as the concept database 110, the basic word co-occurrence concept vector created by the total number of co-occurrence numbers for each synonym converted with respect to the basic word co-occurrence word is used. A function value that is in a monotonically decreasing relationship with a distance such as a cosine distance or a Euclidean distance is suitable as the similarity index. Further, when the thesaurus is used as the concept database 110 in the basic word co-occurrence concept table in which the co-occurrence word concept is a summary of the peripheral word concept vectors corresponding to all the basic word co-occurrence words of a specific basic word, Calculates cosine distance and Euclidean distance between the neighboring word concept vectors corresponding to each basic word co-occurrence word every time, and the weighting distance and monotonously decrease so that the distance in the deeper classification such as small classification is more important A function value in the relationship is appropriate as a similarity index. The clustering method may be a general method, and hierarchical clustering using dendrograms may be applied, or a virtual periphery where similarity between neighboring words and a monotonically decreasing index are derived as distances. Non-hierarchical clustering such as the k-means method or the Fussy-c-means method may be applied to the word position information.

多義語候補推定部60は、基軸単語としたそれぞれの単語について、それぞれの基軸単語に関する各基軸単語共起語のクラスタリング結果から、クラスタの規模が任意に定めた閾値以上である複数のクラスタが存在する基軸単語を、意味的に複数の用法が見られるとして、多義語の可能性が想定される多義語候補として抽出する。なお、前記クラスタの規模としては、クラスタに帰属する各基軸単語共起語の共起数などを指標とすれば良い。   The multiple-sense word candidate estimation unit 60 has, for each word as a base word, a plurality of clusters in which the size of the cluster is equal to or greater than an arbitrarily determined threshold from the clustering result of each base word co-occurrence word for each base word. The key word to be extracted is extracted as a candidate for a multiple word that is assumed to be a multiple word, assuming that a plurality of usages are semantically seen. In addition, as the scale of the cluster, the number of co-occurrence of each basic word co-occurrence word belonging to the cluster may be used as an index.

多義語候補出力部70は、多義語候補推定部60で抽出した多義語候補を出力する。ここで、出力形態は、所要の形態で出力すればよく、文書内における多義語候補の基軸単語を色分けや太字による強調などで明示することで、文書全体を出力する形態などが適当である。他にも、出力形態としては、多義語候補の組合せを抽出した表などの形態であって良い。また、他の出力形態としては、多義語候補とされた基軸単語を主ノード、その基軸単語共起語の概念に基づく各クラスタを中間ノード、各クラスタに帰属する基軸単語共起語を端ノードとして関係をリンクで結んだグラフを表示し、共起数の多いリンクを色分けして強調するなどの形態であって良い。また、出力形態としては、多義語候補を抽出する際に用いた類似性指標などで多義語間に定量的な多義度を付加し、多義度が任意に設定された閾値より大きい多義語のみに表示を限定しても良い。もしくは、出力形態としては、多義語候補間の多義度によって色分けや太字による強調もしくはグラフの単語の文字の大きさなどに強弱を与えるなどしても良い。また、各出力形態を選択できるようにして、ベースとなる表示形態から必要に応じて表やグラフに移行できるようにしてもよい。また、必要に応じて動詞や名詞などを選択的に出力するようにしてもよい。
次に、図1、及び図2に示すシーケンスを参照して、第1の実施形態に係る多義語抽出システム100の全体の動作について詳細に説明する。なお、図2に示す流れ図および以下の説明は処理例であり、適宜求める効果に応じて処理順等を入れ替えたり処理を戻したり繰り返したりすることを行ってもよい。
The multiple word candidate output unit 70 outputs the multiple word candidate extracted by the multiple word candidate estimation unit 60. Here, the output form may be output in a required form, and an appropriate form is to output the entire document by clearly indicating the key word of the multiple meaning word in the document by color coding or bold emphasis. In addition, the output form may be a form such as a table from which combinations of multiple meaning words are extracted. Also, as other output forms, the main word is a base word that is a multiple word candidate, each cluster based on the concept of the base word co-occurrence word is an intermediate node, and the base word co-occurrence word belonging to each cluster is an end node For example, a graph in which relationships are linked by links may be displayed, and links with a large number of co-occurrence numbers may be color-coded and emphasized. In addition, as an output form, a quantitative ambiguity is added between the ambiguities with the similarity index used when extracting the ambiguity candidates, and only the ambiguities where the ambiguity is larger than the arbitrarily set threshold. The display may be limited. Or as an output form, depending on the degree of ambiguity between the ambiguity candidates, color coding, emphasis by bold letters, or the size of the characters of the words in the graph may be given. Further, each output form may be selected so that the display form as a base can be shifted to a table or a graph as necessary. Moreover, you may make it selectively output a verb, a noun, etc. as needed.
Next, with reference to the sequence shown in FIG. 1 and FIG. 2, the overall operation of the polysemy extraction system 100 according to the first embodiment will be described in detail. Note that the flowchart shown in FIG. 2 and the following description are examples of processing, and the processing order and the like may be changed or the processing may be returned or repeated depending on the desired effect.

文書入力部10は、対象とする文書もしくは文書群の入力を受け付ける(図2のステップA1)。
単語分析部20は、文書もしくは文書群を構成する各文章に形態素解析や構文解析などを適用することで、各文章に使用されている名詞、動詞、形容詞、形容動詞など単独で意味をなす自立語を単語として抽出し、さらに単語毎の品詞や直後に使用された助詞の種類、単語間の係り受け関係などの単語情報の抽出を行う(ステップA2)。
The document input unit 10 receives an input of a target document or document group (step A1 in FIG. 2).
The word analysis unit 20 applies morphological analysis or syntactic analysis to each sentence constituting a document or a group of documents, so that nouns, verbs, adjectives, adjective verbs and the like used in each sentence are independent. Words are extracted as words, and word information such as part-of-speech for each word, type of particle used immediately after, and dependency relation between words is extracted (step A2).

基軸単語共起ベクトル抽出部30は、単語分析部20で抽出された各文章に使用されている任意の単語を基軸単語として選択し、基軸単語毎の単語情報に基づき、所定の基軸単語共起判定ルールで基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルを抽出する(ステップA3)。   The basic word co-occurrence vector extraction unit 30 selects an arbitrary word used in each sentence extracted by the word analysis unit 20 as a basic word, and based on word information for each basic word, a predetermined basic word co-occurrence A base word co-occurrence vector represented by a base word co-occurrence word that is regarded as a co-occurrence relationship with the base word in the determination rule and the number of the co-occurrence are extracted (step A3).

概念データベース110は、収集蓄積されている単語の概念分類および一般的な同義語、類義語、用法などの一般概念情報から、特定の単語に関する問い合わせに対して、適宜単語の意味や用法に関連する一般概念情報を検索し応答する(ステップA4)。   The concept database 110 collects and accumulates general concepts related to the meaning and usage of words based on the concept classification of words and general conceptual information such as general synonyms, synonyms, and usages. The conceptual information is retrieved and responded (step A4).

共起語概念推定部40は、概念データベース110の一般概念情報を利用して、所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語について個々の共起語概念を推定する(ステップA5)。   The co-occurrence word concept estimation unit 40 estimates the individual co-occurrence word concept for each basic word co-occurrence word of the basic word co-occurrence vector based on a predetermined concept estimation method using the general concept information of the concept database 110. (Step A5).

共起語分類部50は、特定の基軸単語に関する各基軸単語共起語について、推定した個々の共起語概念を参照することにより対応する前記共起語概念間の類似性を所定の類似性指標によって算出し、その共起語概念間の類似性指標に基づき各基軸単語共起語のクラスタリングを行う(ステップA6)。   The co-occurrence word classifying unit 50 refers to each estimated co-occurrence word concept for each basic word co-occurrence word related to a specific basic word, and determines the similarity between the corresponding co-occurrence word concepts as a predetermined similarity. The basic word co-occurrence words are clustered based on the similarity index between the co-occurrence word concepts (step A6).

多義語候補推定部60は、特定の基軸単語に関する各基軸単語共起語のクラスタリング結果から、各クラスタの規模が任意に定めた閾値以上のクラスタが複数存在する基軸単語を、意味的に複数の用法が見られ多義語の可能性が想定される多義語候補として順次抽出する(ステップA7)。   From the clustering result of each basic word co-occurrence word related to a specific basic word, the polysemy candidate estimation unit 60 semantically selects a plurality of basic words that have a plurality of clusters whose size of each cluster is equal to or greater than a predetermined threshold. The candidate is sequentially extracted as a candidate for a multiple meaning word that can be used and a possibility of a multiple meaning word is assumed (step A7).

多義語候補出力部70は、多義語候補推定部60で抽出できた多義語候補を出力する(ステップA8)。   The multiple word candidate output unit 70 outputs the multiple word candidates that can be extracted by the multiple word candidate estimation unit 60 (step A8).

次に、本発明の第1の実施形態に係る多義語抽出システム100の効果について説明する。
本第1の実施形態では、文書内もしくは文書群内の基軸単語共起語を共起語概念に変換することによって、意味的には類似するが単語としては一致しない共起語をクラスタリングした結果に基づいて多義語候補を抽出するように構成しているため、各基軸単語共起語の出現回数が少なく基軸単語共起語間の距離が0となりがちな文章量の少ない条件でも基軸単語について用法のパターンが複数あるかどうかの把握が可能になり、所定の案件に関する文書内で複数の意味を割り当てられている多義語を精度よく抽出できる。
尚、上記第1の実施形態に係る多義語抽出システム100は、多義語抽出方法として実現され得る。また、上記第1の実施形態に係る多義語抽出システム100は、多義語抽出プログラムによりコンピュータによって実行させるようにしても良い。
Next, the effect of the polysemy extraction system 100 according to the first embodiment of the present invention will be described.
In the first embodiment, a result of clustering co-occurrence words that are semantically similar but do not match as a word by converting a basic word co-occurrence word in a document or document group into a co-occurrence word concept Because it is configured to extract the ambiguous word candidates based on the basic word, even if the number of occurrences of each basic word co-occurrence word is small and the distance between the basic word co-occurrence words tends to be zero, It is possible to grasp whether or not there are a plurality of usage patterns, and it is possible to accurately extract a polysemy assigned a plurality of meanings in a document related to a predetermined case.
Note that the polysemy extraction system 100 according to the first embodiment can be realized as a polysemy extraction method. In addition, the polysemy extraction system 100 according to the first embodiment may be executed by a computer using a polysemy extraction program.

[実施形態2]
次に、第2の実施形態について、図面を参照して詳細に説明する。
図3は、第3の実施形態に係る多義語抽出システム100Aの構成を示すブロック図である。
[Embodiment 2]
Next, a second embodiment will be described in detail with reference to the drawings.
FIG. 3 is a block diagram showing a configuration of a multiple meaning word extraction system 100A according to the third embodiment.

図3を参照すると、第2の実施形態に係る多義語抽出システム100Aは、構成語支配度算出部35と、複合語構成配分推定部36と、を更に含むと共に、後述するように単語分析部と共起語概念推定部の動作が相違する点を除いて、図1に示した第1の実施形態に係る多義語抽出システム100と同様の構成を有し、動作をする。したがって、単語分析部に20Aの参照符号を付し、共起語概念推定部に40Aの参照符号を付してある。   Referring to FIG. 3, the polysemy extraction system 100 </ b> A according to the second embodiment further includes a constituent word dominance calculation unit 35 and a compound word constituent distribution estimation unit 36, and a word analysis unit as will be described later. Except that the operation of the co-occurrence word concept estimation unit is different from that of the multiple meaning word extraction system 100 according to the first embodiment shown in FIG. Therefore, the reference symbol of 20A is attached to the word analysis unit, and the reference symbol of 40A is attached to the co-occurrence word concept estimation unit.

図示の多義語抽出システム100Aを上述したコンピュータで実現した場合、データ処理装置が、文書入力部10、単語分析部20A、基軸単語共起ベクトル抽出部30、構成語支配度算出部35、複合語構成配分推定部36、共起語概念推定部40A、共起語分類部50、多義語候補推定部60として働き、補助記憶装置が概念データベース110として動作し、出力装置が多義語候補出力部70として働く。
そして、単語分析部20Aが文書中の単語の中の複合語および複合語の構成語を取得し、構成語支配度算出部35が、複合語の構成語毎の構成語支配度を算出し、複合語構成配分推定部36が、構成語支配度に基づき複合語の構成語毎の概念に重み付けを行った複合語構成配分表を作成し、共起語概念推定部40Aが、基軸単語共起語を概念に変換する前に、基軸単語共起ベクトルの中で複合語となっている基軸単語共起語の共起数を複合語構成配分表に基づいて分配した共起数に換算を行う。
When the illustrated multiple meaning word extraction system 100A is realized by the above-described computer, the data processing apparatus includes a document input unit 10, a word analysis unit 20A, a basic word co-occurrence vector extraction unit 30, a constituent word dominance calculation unit 35, a compound word It functions as the composition distribution estimation unit 36, the co-occurrence word concept estimation unit 40A, the co-occurrence word classification unit 50, and the multiple meaning word candidate estimation unit 60, the auxiliary storage device operates as the concept database 110, and the output device is the multiple meaning word candidate output unit 70. Work as.
Then, the word analysis unit 20A acquires a compound word and a component word of the compound word in the word in the document, and a component word dominance calculation unit 35 calculates a component word dominance for each component word of the compound word, The compound word composition distribution estimation unit 36 creates a compound word composition distribution table in which the concept for each component word of the compound word is weighted based on the component word dominance degree, and the co-occurrence word concept estimation unit 40A performs the basic word co-occurrence Before converting a word to a concept, the co-occurrence number of the basic word co-occurrence word that is a compound word in the basic word co-occurrence vector is converted to the co-occurrence number distributed based on the compound word composition distribution table. .

次に、多義語抽出システム100Aを構成する各構成要素の動作について説明する。   Next, the operation of each component constituting the polysemy extraction system 100A will be described.

単語分析部20Aは、図1に示した単語分析部20の動作に加え、抽出された各単語の一般概念情報を概念データベース110に問い合わせ、概念データベース110に登録が無く、かつ文字数が2文字以上の単語を複合語として抽出する点で、図1に示した単語分析部20と異なる。さらに単語分析部20Aは、複合語を構成するあらゆる部分文字列について、概念データベース110に一般概念情報を問い合わせ、一般概念情報の登録がある部分文字列を複合語の有意構成語として抽出し、抽出した有意構成語を元の複合語から分離した場合に概念データベース110に一般概念情報の登録が無い部分文字列が残る場合は不明構成語として抽出する点で、図1に示した単語分析部20と異なる。
なお複合語を構成する部分文字列の内、概念データベース110に一般概念情報の登録がある部分文字列の組合せパターンが複数考えられる場合は、任意の構成語分離ルールに基づいて最適な組合せパターンを判定し、その組合せパターンでの有意構成語、不明構成語を抽出する。ここで、構成語分離ルールとしては、不明構成語の文字数が最も少なくなるパターンを優先するルールや、入力された文書中に単独の単語として出現する頻度が高い有意構成語を優先するルール、一般の文書中に単独の単語として出現する頻度が高い有意構成語を優先するルール、およびこれらを組合せたルールなどが有効である。また、入力された文書中に含まれる他の複合語に共通して使用されている文字列が所定頻度以上に使用されている場合にはその文字列を除いた残りの文字列について、有意構成語として優先するルールを用いてもよい。
なお、一般概念情報とはシソ−ラスにおける分類や、単語の意味を直接的に表すキーワード、類語の集合などが考えられる。
なお、以下で単に構成語と記載した場合は有意構成語と不明構成語を含む。
In addition to the operation of the word analysis unit 20 shown in FIG. 1, the word analysis unit 20A inquires the concept database 110 for general concept information of each extracted word, is not registered in the concept database 110, and has two or more characters Is different from the word analysis unit 20 shown in FIG. 1 in that it is extracted as a compound word. Further, the word analysis unit 20A inquires the concept database 110 for general concept information for every partial character string constituting the compound word, extracts a partial character string in which the general concept information is registered as a significant component word of the compound word, and extracts it. When the significant constituent word is separated from the original compound word, if a partial character string for which general concept information is not registered remains in the concept database 110, the word analyzing unit 20 shown in FIG. And different.
Of the partial character strings constituting a compound word, when there are a plurality of combination patterns of partial character strings for which general concept information is registered in the concept database 110, an optimal combination pattern is selected based on an arbitrary constituent word separation rule. Judgment is made, and significant constituent words and unknown constituent words in the combination pattern are extracted. Here, as a constituent word separation rule, a rule that prioritizes a pattern that minimizes the number of characters of unknown constituent words, a rule that prioritizes significant constituent words that frequently appear as a single word in the input document, Rules that prioritize significant constituent words that frequently appear as single words in the document and rules that combine these are effective. In addition, if a character string that is used in common with other compound words included in the input document is used more than the specified frequency, the remaining character strings excluding the character string are significantly composed. Rules that have priority over words may be used.
The general concept information includes a thesaurus classification, a keyword that directly represents the meaning of a word, a set of synonyms, and the like.
In the following description, a simple constituent word includes a significant constituent word and an unknown constituent word.

構成語支配度算出部35は、単語分析部20Aで抽出された各文章に使用されている単語および複合語に基づき、任意の複合語共起判定ルールで複合語と共起する単語を複合語共起語として、複合語毎に複合語共起語とその共起数を抽出し、これらをまとめることで複合語共起表を作成する。
ここで、複合語共起判定ルールとしては1文、1段落内の全文章、目次上の同一項目内での全文章、文書全体、文書のタイトル、文書群の中での位置付けなど、文書の特徴に合わせて選択して良い。例えば、品詞が動詞であれば1文内での共起、名詞であれば目次上の同一項目内での全文章内共起のように品詞毎に文書群の範囲を変えるようにすれば良い。
また、共起数は共起回数でも良いが、共起回数を複合語毎の全共起語数で除した頻度などでも良い。
さらに、単語情報に単語間の係り受け関係が含まれる場合は、係り受け関係のある単語かどうかを複合語共起判定ルールとして利用しても良い。
また、複合語共起表とは各行が各複合語に、各列が各複合語共起語に対応している行列で、複合語に対する複合語共起語の共起数が表の各値として登録されたものである。
The constituent word dominance calculation unit 35 determines a word that co-occurs with a compound word in an arbitrary compound word co-occurrence determination rule based on the word and compound word used in each sentence extracted by the word analysis unit 20A. As a co-occurrence word, a compound word co-occurrence word and the number of co-occurrence are extracted for each compound word, and a compound word co-occurrence table is created by collecting them.
Here, the compound word co-occurrence determination rule includes one sentence, all sentences in one paragraph, all sentences in the same item on the table of contents, the whole document, document title, position in the document group, etc. You may choose according to the characteristics. For example, if the part of speech is a verb, co-occurrence within one sentence, and if it is a noun, the range of the document group may be changed for each part of speech, such as co-occurrence within all sentences within the same item on the table of contents. .
The number of co-occurrence may be the number of co-occurrence, but may be a frequency obtained by dividing the number of co-occurrence by the total number of co-occurrence words for each compound word.
Further, when the word information includes a dependency relationship between words, whether or not the word has a dependency relationship may be used as a compound word co-occurrence determination rule.
The compound word co-occurrence table is a matrix in which each row corresponds to each compound word and each column corresponds to each compound word co-occurrence word. It is registered as.

さらに、構成語支配度算出部35は、複合共起表と単語分析部20Aで抽出された構成語に基づき、その複合語共起表から同じ構成語を含む部分一致複合語の複合語共起語からなる複合語共起ベクトルを抽出し、構成語別に部分一致複合語共起表を作成する。そして、部分一致複合語共起表の複合語共起ベクトルから得られる共起ベクトル空間における各部分一致複合語間の集約度を構成語支配度として算出する。
ここで、共起ベクトル空間は各ベクトルを対等としても良いが、複合語共起語の品詞によって重み付けを行ったベクトル空間に変換しても良い。また、各部分一致複合語間の集約度とは各部分一致複合語に対応するベクトル間の散らばりの小ささを表す指標であればどのような算出方法によっても良い。例えば分散や標準偏差、変動係数などの一般に統計で用いられるばらつきを示す指標と単調減少の関係にある関数であればよく、分散の逆数や変動係数の逆数などが適している。
Further, the constituent word dominance calculating unit 35 is based on the constituent words extracted by the compound co-occurrence table and the word analyzing unit 20A, and the compound word co-occurrence of partially matching compound words including the same constituent word from the compound word co-occurrence table. A compound word co-occurrence vector consisting of words is extracted, and a partially matched compound word co-occurrence table is created for each constituent word. Then, the degree of aggregation between the partial coincidence compound words in the co-occurrence vector space obtained from the compound word co-occurrence vector of the partial coincidence compound word co-occurrence table is calculated as the constituent word dominance.
Here, the co-occurrence vector space may be equivalent to each vector, but may be converted into a vector space weighted by the part of speech of the compound word co-occurrence word. In addition, the degree of aggregation between each partially matched compound word may be any calculation method as long as it is an index representing the degree of dispersion between vectors corresponding to each partially matched compound word. For example, a function having a monotonous decrease relationship with an index indicating dispersion generally used in statistics, such as dispersion, standard deviation, and variation coefficient, may be used, and an inverse of dispersion or an inverse of variation coefficient is suitable.

複合語構成配分推定部36は、構成語支配度算出部35で算出した各構成語支配度で複合語毎の各構成語間の構成語重み付け係数を算出し、構成語重み付け係数をまとめた複合語構成配分表を作成する。
複合語概念構成表とは、各行が各複合語に、各列が複合語の各構成語に対応した行列で、対応する構成語重み付け係数が登録されたものである。
ここで、構成語重み付け係数の算出方法としては、各構成語の構成語支配度を複合語毎の構成語支配度の総和で除すことで正規化した値を指標とする方法などが有効である。
The compound word composition distribution estimation unit 36 calculates a composition word weighting coefficient between the composition words for each compound word with each composition word dominance calculated by the composition word dominance degree calculation unit 35, and combines the composition word weighting coefficients. Create a word structure distribution table.
The compound word concept composition table is a matrix in which each row corresponds to each compound word and each column corresponds to each composition word of the compound word, and corresponding composition word weighting coefficients are registered.
Here, as a method for calculating a constituent word weighting coefficient, a method that uses a normalized value as an index by dividing the constituent word dominance of each constituent word by the sum of the constituent word dominance for each compound word is effective. is there.

共起語概念推定部40Aは、上記説明した共起語概念推定部40の動作に加え、基軸単語共起ベクトル抽出部30で作成された基軸単語共起表の基軸単語共起ベクトルの各共起語の内で複合語になっている基軸単語共起語について、複合語構成配分推定部36で作成した複合語構成配分表に基づく係数を使用して、所要の推定方法に合致させて各複合語に適する共起語概念を推定する。一例としては、共起語概念推定部40Aに、各複合語を構成する各構成語をそれぞれ基軸単語共起語として独立させ、複合語構成配分推定部36で作成した複合語構成配分表に基づき、前記基軸単語共起語の共起数に各構成語の構成語重み付け係数を掛けて算出した値を各構成語の共起数として基軸単語共起ベクトルを変更し、前記所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する。
なお、使用する概念推定方法として、複合語を含む周辺語を考慮し、特定の基軸単語の全基軸単語共起語に対応する周辺概念ベクトルをまとめた基軸単語共起概念表を共起語概念とする場合、前記基軸単語共起表の周辺語構成ベクトルの各周辺語の内で複合語になっている周辺語について、各構成語をそれぞれ周辺語として独立させ、複合語構成配分推定部36で作成した複合語構成配分表に基づき、前記周辺語の存在数に各構成語の構成語重み付け係数を掛けて算出した値を各構成語の存在数として周辺語構成ベクトルを変更しても良い。
In addition to the operation of the co-occurrence word concept estimation unit 40 described above, the co-occurrence word concept estimation unit 40A performs each co-occurrence of the base word co-occurrence vector of the base word co-occurrence table created by the base word co-occurrence vector extraction unit 30. For the base word co-occurrence word that is a compound word in the word, each coefficient is matched with the required estimation method using a coefficient based on the compound word composition distribution table created by the compound word composition distribution estimation unit 36. Estimate co-occurrence word concepts suitable for compound words. As an example, the co-occurrence word concept estimation unit 40A makes each component word constituting each compound word independent as a basic word co-occurrence word, and is based on the compound word component distribution table created by the compound word component distribution estimation unit 36. The predetermined word estimation method by changing the basic word co-occurrence vector using the value calculated by multiplying the co-occurrence number of the basic word co-occurrence word by the constituent word weighting coefficient of each constituent word as the co-occurrence number of each constituent word, Based on the above, the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector is estimated.
As a concept estimation method to be used, a co-occurrence word concept co-occurrence concept table that summarizes peripheral concept vectors corresponding to all basic word co-occurrence words of a specific basic word, taking into account peripheral words including compound words, is used. In this case, for the peripheral words that are compound words among the peripheral words of the peripheral word constituent vectors of the basic word co-occurrence table, each constituent word is made independent as a peripheral word, and the composite word constituent distribution estimating unit 36 The peripheral word composition vector may be changed based on the compound word composition distribution table created in step 1 by using the value calculated by multiplying the number of neighboring words by the constituent word weighting coefficient of each constituent word. .

それ以外の文書入力部10、基軸単語共起ベクトル抽出部30、共起語分類部50、多義語候補推定部60、多義語候補出力部70、概念データベース110の構成と機能は、第1の実施形態のそれらとそれぞれ同じであるので、説明を省略する。   Other configurations and functions of the document input unit 10, the basic word co-occurrence vector extraction unit 30, the co-occurrence word classification unit 50, the multiple meaning word candidate estimation unit 60, the multiple meaning word candidate output unit 70, and the concept database 110 are as follows. Since they are the same as those of the embodiment, description thereof is omitted.

次に、図3、及び図4に示すシーケンスを参照して、第2の実施形態に係る多義語抽出システム100Aの全体の動作について説明する。なお、図4に示す流れ図および以下の説明は処理例であり、第1の実施形態と同様に処理順等を入れ替えたり処理を戻したりすることを行ってもよい。
上述した第1の実施形態の動作と比較すると、以下に説明する本第2の実施形態の動作は、次の動作が加わっている点で異なる。
Next, with reference to the sequence shown in FIG. 3 and FIG. 4, the overall operation of the polysemy extraction system 100A according to the second embodiment will be described. Note that the flowchart shown in FIG. 4 and the following description are examples of processing, and the processing order and the like may be changed or the processing may be returned as in the first embodiment.
Compared to the operation of the first embodiment described above, the operation of the second embodiment described below is different in that the following operation is added.

単語分析部20Aは、図1に示した単語分析部20の動作(ステップA2)に加え、抽出された各単語の一般概念情報を概念データベース110に問い合わせ、概念データベース110に登録が無く、かつ文字数が2文字以上の単語を複合語として抽出する(ステップB1)。
さらに単語分析部20Aは、複合語を構成するあらゆる部分文字列について、概念データベース110に一般概念情報を問い合わせ、一般概念情報の登録がある部分文字列を複合語の有意構成語として抽出し、抽出した有意構成語を元の複合語から分離した場合に概念データベース110に一般概念情報の登録が無い部分文字列が残る場合は不明構成語として抽出する(ステップB2)。
In addition to the operation (step A2) of the word analysis unit 20 shown in FIG. 1, the word analysis unit 20A inquires the concept database 110 for the general concept information of each extracted word, is not registered in the concept database 110, and has the number of characters Is extracted as a compound word (step B1).
Further, the word analysis unit 20A inquires the concept database 110 for general concept information for every partial character string constituting the compound word, extracts a partial character string in which the general concept information is registered as a significant component word of the compound word, and extracts it. When the significant constituent word is separated from the original compound word, if a partial character string without registration of general concept information remains in the concept database 110, it is extracted as an unknown constituent word (step B2).

次に構成語支配度算出部35は、単語分析部20Aで抽出された各文章に使用されている単語の単語情報および複合語に基づき、複合語共起判定ルールで複合語と共起する単語を複合語共起語として、複合語毎に複合語共起語とその共起数を抽出し、これらをまとめることで複合語共起表を作成する(ステップB3)。
さらに構成語支配度算出部35は、複合共起表と単語分析部20Aで抽出された構成語に基づき、前記複合語共起表から同じ構成語を含む部分一致複合語の複合語共起語からなる複合語共起ベクトルを抽出し、構成語別に部分一致複合語共起表を作成し、部分一致複合語共起表の複合語共起ベクトルから得られる共起ベクトル空間における各部分一致複合語間の集約度を構成語支配度として算出する(ステップB4)。
Next, the constituent word dominance degree calculation unit 35 is a word that co-occurs with the compound word in the compound word co-occurrence determination rule based on the word information and the compound word of the word used in each sentence extracted by the word analysis unit 20A. As a compound word co-occurrence word, a compound word co-occurrence word and the number of co-occurrence are extracted for each compound word, and a compound word co-occurrence table is created by collecting them (step B3).
Further, the constituent word dominance calculating unit 35 is based on the constituent words extracted by the compound co-occurrence table and the word analyzing unit 20A, and is a compound word co-occurrence word of partially matching compound words including the same constituent word from the compound word co-occurrence table. Extracts compound word co-occurrence vectors consisting of, creates partial match compound word co-occurrence table for each constituent word, and each partially match compound in co-occurrence vector space obtained from compound word co-occurrence vector of partial match compound word co-occurrence table The degree of aggregation between words is calculated as the constituent word dominance (step B4).

次に複合語構成配分推定部36は、構成語支配度算出部35で算出した各構成語支配度で複合語毎の各構成語間の構成語重み付け係数を算出し、構成語重み付け係数をまとめた複合語構成配分表を作成する(ステップB5)。   Next, the compound word composition distribution estimation unit 36 calculates a component word weighting coefficient between the component words for each compound word with each component word dominance calculated by the component word dominance degree calculation unit 35, and summarizes the component word weighting coefficients. A compound word composition distribution table is created (step B5).

共起語概念推定部40Aは、図1に示した共起語概念推定部40の動作内容(ステップA5)に加え、基軸単語共起ベクトル抽出部30で作成された基軸単語共起表の基軸単語共起ベクトルの各共起語の内で複合語になっている基軸単語共起語について、各構成語をそれぞれ基軸単語共起語として独立させ、複合語構成配分推定部36で作成した複合語構成配分表に基づき、前記基軸単語共起語の共起数に各構成語の構成語重み付け係数を掛けて算出した値を各構成語の共起数として基軸単語共起ベクトルを変更し、前記所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する(ステップA5’)。
他のステップの動作は、上述した第1の実施形態における動作と同一であるので、それらの説明については省略する。
The co-occurrence word concept estimation unit 40A, in addition to the operation content of the co-occurrence word concept estimation unit 40 shown in FIG. 1 (step A5), the basic axis of the basic word co-occurrence table created by the basic word co-occurrence vector extraction unit 30 For the basic word co-occurrence word that is a compound word among the co-occurrence words of the word co-occurrence vector, each component word is made independent as a basic word co-occurrence word, and the compound word composition allocation estimation unit 36 creates Based on the word composition distribution table, the basic word co-occurrence vector is changed with the value calculated by multiplying the co-occurrence number of the basic word co-occurrence word by the constituent word weighting coefficient of each constituent word as the co-occurrence number of each constituent word, Based on the predetermined concept estimation method, the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector is estimated (step A5 ′).
The operation of other steps is the same as the operation in the first embodiment described above, and a description thereof will be omitted.

次に、第2の実施形態の上記動作での効果について説明する。
第2の実施形態では、第1の実施の形態の効果に加え、基軸単語共起語の中の複合語について構成語毎の構成語支配度を算出し、構成語支配度に基づき重み付けを行った概念に変換する。このことによって、シソーラスなどに一般概念情報の登録が無い複合語なども考慮して多義語候補を抽出するように構成できるため、基軸単語共起ベクトルから基軸単語概念ベクトルへの変換の障害となる、独自の複合語の多い文章群でも基軸単語共起語間の類似性の評価が可能になり、所定の案件に関する文書内で複数の意味を割り当てられている多義語をより精度よく抽出できる。
尚、上記第2の実施形態に係る多義語抽出システム100Aは、多義語抽出方法として実現され得る。また、上記本発明の第2の実施形態に係る多義語抽出システム100Aは、多義語抽出プログラムによりコンピュータによって実行させるようにしても良い。
Next, effects of the above-described operation of the second embodiment will be described.
In the second embodiment, in addition to the effects of the first embodiment, the constituent word dominance degree is calculated for each constituent word for the compound word in the basic word co-occurrence word, and weighting is performed based on the constituent word dominance degree. Convert to a new concept. As a result, it is possible to extract a polysemy candidate in consideration of compound words in which the general concept information is not registered in the thesaurus or the like, which becomes an obstacle to the conversion from the basic word co-occurrence vector to the basic word concept vector. In addition, it is possible to evaluate the similarity between basic word co-occurrence words even in a group of sentences with many unique compound words, and it is possible to more accurately extract a multiple meaning word assigned a plurality of meanings in a document related to a predetermined case.
Note that the multiple meaning extraction system 100A according to the second embodiment may be realized as a multiple meaning extraction method. Moreover, you may make it perform the polysemy extraction system 100A which concerns on the said 2nd Embodiment of this invention with a computer by a polysemy extraction program.

次に、図5を参照して、具体的な第1の実施例を用いて、第1の実施形態に係る多義語抽出システム100の動作について説明する。   Next, with reference to FIG. 5, the operation of the polysemy extraction system 100 according to the first embodiment will be described using a specific first example.

本第1の実施例では、次のことを目的としている。
先ず、多義語抽出システム100は、情報システム構築に関する提案書や仕様書といった一般的な意味と異なった概念を示す意味としても使用される多義語を含む文書D内に含まれる特定の案件に関する文書群でのみ成り立つ多義語候補Aを推定する。そして、多義語抽出システム100は、推定結果を出力することで、未登録の用語に関する用語集の作成や単語の定義を支援する。また、本第1の実施例では、多義語抽出システム100は、図5に示されるように、文書解析システムYと、インターネット・サーバZとで構成されるものとする。
文書解析システムYは、分析実施者Bの持つPC端末上で動作し、入力部及び出力部を介して、分析実施者Bが多義語を抽出したい文書群を構成する文章の入力と、多義語候補Aの提示を実現する。
インターネット・サーバZは、通信ネットワークを介して文書解析システムYを実装した分析実施者Bの持つPC端末と接続されている。インターネット・サーバZは、文書解析システムYからの単語の意味などの概念情報の問い合わせに対して、単語の概念分類や一般的な多義語や類義語、用法に関連する一般概念情報Cgの検索を可能にする装置である。
The purpose of the first embodiment is as follows.
First, the polysemy extraction system 100 is a document related to a specific item included in a document D including a polysemy that is also used as a meaning indicating a concept different from a general meaning such as a proposal or specification regarding information system construction. Estimate the ambiguous word candidate A that holds only in the group. Then, the ambiguous word extraction system 100 supports the creation of a glossary and the definition of words related to unregistered terms by outputting the estimation result. Further, in the first embodiment, the polysemy extraction system 100 is composed of a document analysis system Y and an Internet server Z as shown in FIG.
The document analysis system Y operates on the PC terminal possessed by the analyst B, and through the input unit and the output unit, inputs of sentences constituting a group of documents for which the analyst B wants to extract the polygram, Realization of candidate A is realized.
The Internet server Z is connected via a communication network to a PC terminal of the analysis person B who has implemented the document analysis system Y. The Internet server Z can search for general concept information Cg related to word concept classification, general polysemy, synonyms, and usage in response to a query of conceptual information such as the meaning of a word from the document analysis system Y. It is a device to make.

図5と図1との対応関係について説明する。
文書入力部10と、単語分析部20と、基軸単語共起ベクトル抽出部30と、共起語概念推定部40と、共起語分類部50と、多義語候補推定部60とは、文書解析システムY内に含まれている。多義語候補出力部70は、PC端末の出力部として動作する。概念データベース110はインターネット・サーバZ内に含まれている。この様な手段を備えた文書解析システムY、インターネット・サーバZは以下のような動作をする。
The correspondence between FIG. 5 and FIG. 1 will be described.
The document input unit 10, the word analysis unit 20, the basic word co-occurrence vector extraction unit 30, the co-occurrence word concept estimation unit 40, the co-occurrence word classification unit 50, and the polysemy candidate estimation unit 60 It is included in the system Y. The polysemy candidate output unit 70 operates as an output unit of the PC terminal. The concept database 110 is included in the Internet server Z. The document analysis system Y and the Internet server Z provided with such means operate as follows.

文書解析システムYは、入力部から、分析実施者Bが特定の案件に関する文書群でのみ成り立つ多義語候補Aを推定したい文書群を構成する文書Dの入力を受け付ける。そして、文書解析システムYは、文書Dを構成する文章毎に形態素解析および構文解析を適用して文書を構成する単語に分解し、各単語の品詞とその係り受け関係を解析することで、名詞および、動詞、形容詞、形容動詞を単語Wとして抽出する。なお、動詞の内でサ行変格活用に属する動詞は活用部分を除去しいわゆるサ変名詞化したものを動詞として抽出することとする。   The document analysis system Y receives from the input unit input of a document D that constitutes a document group for which the analyst B wants to estimate a multiple meaning candidate A that is formed only in the document group related to a specific case. Then, the document analysis system Y applies morphological analysis and syntax analysis to each sentence constituting the document D, decomposes it into words constituting the document, and analyzes the part-of-speech of each word and its dependency relationship, And the verb, the adjective, and the adjective verb are extracted as the word W. Of the verbs, the verbs belonging to the sa line modification use are extracted as the verbs by removing the use part and converting them into so-called nouns.

さらに文書解析システムYは、文書Dに含まれる単語Wの内で名詞を基軸単語Sとし、各基軸単語Si(i=1、2、・・・、n)について、特定の基軸単語Sw(i=w)と係り受け関係にある動詞と形容詞と形容動詞を、基軸単語共起語Vwj(j=1、2、・・・、m)として抽出し、基軸単語Swに対する各基軸単語共起語Vwjの共起回数を共起数Nwjとして集計し、基軸単語共起ベクトルNwを作成する。例えば、文書Dから、基軸単語Swとして「資材」、「総務」、・・・などの単語が、共起語Vとして「製造」、「備蓄」、「採掘」、「廃棄」、「混合」、「購買処理」、「見積」、「発注」、「予算」、「変更処理」、・・・などの単語が抽出され、各基軸単語Swに対する基軸単語共起語Vwjの共起数Nwjが図6のようになっていた場合、図6の表の各行のデータセットが基軸単語共起ベクトルNiに、特定の基軸単語Swのデータセットが基軸単語共起ベクトルNwに相当し、「資材」の基軸単語共起ベクトルNwは{5,1,1,1,3,3,4,2,1,4・・・}のように表される。   Further, the document analysis system Y uses the noun as the base word S among the words W included in the document D, and for each base word Si (i = 1, 2,..., N), a specific base word Sw (i = W), verbs, adjectives and adjective verbs that are in a dependency relationship are extracted as basic word co-occurrence words Vwj (j = 1, 2,..., M), and each basic word co-occurrence word for the basic word Sw The number of co-occurrence of Vwj is tabulated as the co-occurrence number Nwj, and a basic word co-occurrence vector Nw is created. For example, from the document D, the words “material”, “general affairs”,..., Etc. as the basic words Sw, and the “co-occurrence words V” “manufacturing”, “stockpiling”, “mining”, “disposal”, “mixing” , “Purchase process”, “Estimate”, “Order”, “Budget”, “Change process”,... Are extracted, and the co-occurrence number Nwj of the basic word co-occurrence word Vwj for each basic word Sw In the case of FIG. 6, the data set of each row in the table of FIG. 6 corresponds to the basic word co-occurrence vector Ni, the data set of a specific basic word Sw corresponds to the basic word co-occurrence vector Nw, and “material” The basic word co-occurrence vector Nw is expressed as {5, 1, 1, 1, 3, 3, 4, 2, 1, 4.

インターネット・サーバZは、単語の一般的な上位/下位関係、部分/全体関係、同義関係、類義関係などによって単語を分類して体系づけたシソーラスの一般概念情報Cgを蓄積する。また、インターネット・サーバZは、任意の単語の情報を抽出する検索エンジンなどの機能も提供することで、文書解析システムYからの問い合わせに応じて、問い合わせ対象の単語の一般的な概念分類として大分類、中分類、小分類を一般概念情報Cgとして抽出し、提示する。   The Internet server Z accumulates thesaurus general concept information Cg that is organized by classifying words according to general upper / lower relationship, partial / whole relationship, synonym relationship, synonym relationship, and the like. In addition, the Internet server Z also provides a function such as a search engine that extracts information on an arbitrary word, so that it can be used as a general concept classification of words to be inquired according to an inquiry from the document analysis system Y. The classification, middle classification, and minor classification are extracted as general concept information Cg and presented.

文書解析システムYは、基軸単語共起ベクトルNwの各基軸単語共起語Vwjのそれぞれに関する共起語概念Cvwjを、インターネット・サーバZに問い合わせた一般概念情報Cgに基づき抽出する。   The document analysis system Y extracts the co-occurrence word concept Cvwj for each of the basic word co-occurrence words Vwj of the basic word co-occurrence vector Nw based on the general concept information Cg inquired of the Internet server Z.

共起語概念Cvwjの抽出方法としては、直接的に各基軸単語共起語Vwjのそれぞれの一般概念情報Cgについてインターネット・サーバZに問い合わせを行うことで、インターネット・サーバZ内に保存されたシソーラスの一般概念情報Cgの分類体系から、各基軸単語共起語Vwjが属する共起語概念Cvwjとして大分類の共起語概念C1vwjと、中分類の共起語概念C2vwjと、小分類の共起語概念C3vwjとを抽出し、各分類の階層での概念共起数Ncwjが分かるように木構造などにまとめた共起語概念図Cvwjを作成する方法が適切である。この方法を概念直接抽出法とする。概念直接抽出方法に寄れば、図6の基軸単語Sw「資材」に関する基軸単語共起ベクトルNwの共起数Nwjの各基軸単語共起語Vwjについて、図7のような共起語概念C1vwj、共起語概念C2vwj、共起語概念C3vwjが抽出された場合、共起語概念図Cvwjは図8のような樹形図で表される。図8で共起語概念図Cvwの各分類の階層での概念共起数Ncwjはより下位に帰属する基軸単語共起語Vwjの共起数Nwjの総和で算出される。なお、インターネット・サーバZに一般概念情報Cgの登録が無い「変更処理」、「購買処理」などの複合語は、基軸単語共起語の単語をそのまま仮の概念として残して処理する。   As a method for extracting the co-occurrence word concept Cvwj, a thesaurus stored in the Internet server Z can be obtained by inquiring the Internet server Z about the general concept information Cg of each basic word co-occurrence word Vwj directly. From the classification system of the general concept information Cg, the co-occurrence word concept C1vwj of the large classification, the co-occurrence word concept C2vwj of the middle classification, and the co-occurrence of the small classification as the co-occurrence word concept Cvwj to which each basic word co-occurrence word Vwj belongs An appropriate method is to extract the word concept C3vwj and create a co-occurrence word conceptual diagram Cvwj collected in a tree structure or the like so that the concept co-occurrence number Ncwj in each classification hierarchy can be understood. This method is a concept direct extraction method. According to the concept direct extraction method, for each basic word co-occurrence word Vwj of the co-occurrence number Nwj of the basic word co-occurrence vector Nw related to the basic word Sw “material” in FIG. 6, the co-occurrence word concept C1vwj as shown in FIG. When the co-occurrence word concept C2vwj and the co-occurrence word concept C3vwj are extracted, the co-occurrence word concept diagram Cvwj is represented by a tree diagram as shown in FIG. In FIG. 8, the concept co-occurrence number Ncwj in the hierarchy of each classification of the co-occurrence word conceptual diagram Cvw is calculated as the sum of the co-occurrence numbers Nwj of the basic word co-occurrence words Vwj belonging to lower levels. It should be noted that compound words such as “change process” and “purchase process” for which general concept information Cg is not registered in the Internet server Z are processed while leaving the words of the basic word co-occurrence words as they are as tentative concepts.

また、より高度な共起語概念Cvwjの抽出方法として概念間接抽出法を以下に解説する。概念間接抽出方法では、各基軸単語共起語Vwjのそれぞれについて、各基軸単語共起語Vwjと係り受け関係にある動詞と形容詞と形容動詞、および目次上の同一項目内の文章内で共起する名詞を、周辺語Vvwjf(f=1、2、・・・、y)として抽出し、基軸単語共起語Vwjに対する各周辺語Vvwjfの共起回数を存在数Ljfとして集計し、全ての基軸単語共起語Vwjに対する各周辺語Vvwjfについて表形式にまとめた周辺語構成表VVを作成する。
なお、周辺語構成表VVの基軸単語共起語Vwjに対する各周辺語Vvwjfの存在数Ljfをまとめたデータセットを周辺語構成ベクトルLjと呼ぶ。周辺語構成表VVの各周辺語Vvwjfのそれぞれの一般概念情報Cgについてインターネット・サーバZに問い合わせを行うことで、インターネット・サーバZ内に保存されたシソーラスの一般概念情報Cgの分類体系から取得する。その後、各周辺語Vvwjfが属する大分類の周辺語概念C1vwjfと、中分類の周辺語概念C2vwjfと、小分類の周辺語概念C3vwjfとを抽出し、周辺語構成表VVにおける周辺語Vvwjfを周辺語概念C1vwjfに変換し、同じ概念となる周辺語Vvwjfをまとめ、存在数Ljfの和を対応箇所へ登録した大分類の共起語概念表VC1、周辺語構成表VVにおける周辺語Vvwjfを周辺語概念C2vfwjfに変換し、同じ概念となる周辺語Vvwjfをまとめ、存在数Ljfの和を対応箇所へ登録した中分類の共起語概念表VC2、周辺語構成表VVにおける周辺語Vvwjfを周辺語概念C3vfwjfに変換し、同じ概念となる周辺語Vvwjfをまとめ、存在数Ljfの和を対応箇所へ登録した小分類の共起語概念表VC3を作成する。
なお、大分類の共起語概念表VC1の基軸単語共起語Vwjに対する各周辺語概念C1vfwjfの存在数Lc1jfをまとめたデータセットを大分類共起語概念ベクトルLc1jと呼び、中分類の共起語概念表VC2の基軸単語共起語Vwjに対する各周辺語概念C2vfwjfの存在数Lc2jfをまとめたデータセットを中分類基軸単語概念ベクトルLc2jと呼び、小分類の共起語概念表VC3の基軸単語共起語Vwjに対する各周辺語概念C3vfwjfの存在数Lc3jfをまとめたデータセットを小分類共起語概念ベクトルLc3jと呼ぶ。
In addition, a concept indirect extraction method will be described below as a method for extracting a more advanced co-occurrence word concept Cvwj. In the concept indirect extraction method, for each basic word co-occurrence word Vwj, verbs, adjectives and adjective verbs that are dependent on each basic word co-occurrence word Vwj, and co-occurrence in sentences in the same item on the table of contents Are extracted as peripheral words Vvwjf (f = 1, 2,..., Y), and the number of co-occurrence of each peripheral word Vvwjf with respect to the basic word co-occurrence word Vwj is counted as the existence number Ljf. A peripheral word configuration table VV is created in which the peripheral words Vvwjf for the word co-occurrence word Vwj are summarized in a table format.
A data set in which the number Ljf of each peripheral word Vvwjf with respect to the basic word co-occurrence word Vwj in the peripheral word configuration table VV is collected is referred to as a peripheral word configuration vector Lj. Obtained from the classification system of the thesaurus general concept information Cg stored in the Internet server Z by making an inquiry to the Internet server Z for the general concept information Cg of each peripheral word Vvwjf in the peripheral word composition table VV . Then, the peripheral word concept C1vwjf of the large classification to which each peripheral word Vvwjf belongs, the peripheral word concept C2vwjf of the medium classification, and the peripheral word concept C3vwjf of the small classification are extracted, and the peripheral word Vvwjf in the peripheral word composition table VV is extracted as the peripheral word The concept C1vwjf is converted into a concept C1vwjf, peripheral words Vvwjf having the same concept are collected, the sum of the number of existence Ljf is registered in the corresponding location, and the co-occurrence word concept table VC1 of the broad classification and the peripheral word Vvwjf in the peripheral word composition table VV C2vfwjf is converted into the same concept, peripheral words Vvwjf are grouped together, and the middle class co-occurrence word concept table VC2 in which the sum of the existing numbers Ljf is registered in the corresponding locations, and the peripheral word Vvwjf in the peripheral word configuration table VV are converted into the peripheral word concept C3vfwjf Co-occurrence of sub-categories in which peripheral words Vvwjf with the same concept are collected and the sum of the number of existence Ljf is registered in the corresponding location To create a concept table VC3.
A data set in which the number Lc1jf of each peripheral word concept C1vfwjf for the basic word co-occurrence word Vwj in the large-category co-occurrence word concept table VC1 is called a large-category co-occurrence word concept vector Lc1j. A data set in which the number Lc2jf of each peripheral word concept C2vfwjf with respect to the basic word co-occurrence word Vwj in the word concept table VC2 is called a middle classification basic word concept vector Lc2j, and the basic word co-occurrence word VC3 in the small classification co-occurrence word conceptual table VC3. A data set in which the numbers Lc3jf of the peripheral word concepts C3vfwjf for the word Vwj are collected is called a small classification co-occurrence word concept vector Lc3j.

ここで、大分類共起語概念ベクトルLc1jと中分類基軸単語概念ベクトルLc2jと小分類共起語概念ベクトルLc3jとが共起語概念Cvwjに相当する。例えば、図6のように、文書Dから基軸単語共起語Vwjとして「製造」、「変更処理」、・・・などの単語が抽出され、これらの基軸単語共起語Vwjの周辺語Vvwjfとして「利用」、「操作」、「構築」、「改善」、「システム変更」、「メカニズム」、「瞬時」、「短期」、「稼働」、「高速処理」、・・・などの単語が抽出された場合、周辺語構成表VVは図9のような、各行に基軸単語共起語Vwjを各列に周辺語Vvwjfを配置し、その存在数Ljfを記載した表になる。また、図9の基軸単語共起語Vwjの行のデータセットが周辺語構成ベクトルLjに相当し、「製造」の周辺語構成ベクトルLjは{0、3、2、0、4、0、1、0、3、0、・・・}のように表される。なお、基軸単語共起語Vwjと周辺語Vvwjfはいずれも名詞を含むため、先に基軸単語共起語Vwjとして選択された単語も、他の単語が基軸単語共起語Vwjの場合は周辺語Vvwjfとして扱われることがある。   Here, the large classification co-occurrence word concept vector Lc1j, the middle classification basic word concept vector Lc2j, and the small classification co-occurrence word concept vector Lc3j correspond to the co-occurrence word concept Cvwj. For example, as shown in FIG. 6, words such as “manufacturing”, “change processing”,... Are extracted from the document D as the basic word co-occurrence word Vwj, and the peripheral words Vvwjf of these basic word co-occurrence words Vwj are extracted. Extract words such as “use”, “operation”, “construction”, “improvement”, “system change”, “mechanism”, “instantaneous”, “short-term”, “operation”, “high-speed processing”, etc. In this case, the peripheral word composition table VV is a table in which the basic word co-occurrence word Vwj is arranged in each row and the peripheral word Vvwjf is arranged in each column and the number of existence Ljf is described as shown in FIG. Further, the data set in the row of the basic word co-occurrence word Vwj in FIG. 9 corresponds to the peripheral word constituent vector Lj, and the peripheral word constituent vector Lj of “manufacturing” is {0, 3, 2, 0, 4, 0, 1 , 0, 3, 0,... Since both the basic word co-occurrence word Vwj and the peripheral word Vvwjf include nouns, the word previously selected as the basic word co-occurrence word Vwj is a peripheral word when another word is the basic word co-occurrence word Vwj. It may be treated as Vvwjf.

さらに図9の周辺語構成表VVにおける各周辺語Vvwjfについて、図10のような周辺語概念C1vwjf、周辺語概念C2vwjf、周辺語概念C3vwjfが抽出された場合、大分類の共起語概念表VC1は図11、中分類の共起語概念表VC2は図12、小分類の共起語概念表VC3は図13のような各行に基軸単語共起語Vwjを各列に各分類の周辺語概念Cvwjfを配置した表となる。共起語概念表VC1、VC2、VC3の各共起数は、大分類の共起語概念表VC1を例とすると、周辺語Vvwjfの内で「利用」、「操作」、「構築」、「改善」、「稼働」の周辺語概念C1vwjfは「人間活動」で共通のため、これらの周辺語Vvwjfにおける存在数を同一の基軸単語共起語「製造」に関して足し合わせた「8」が存在数Lc1jfとなる。同様に周辺語Vvwjfの内で「メカニズム」、「瞬時」、「短期」の周辺語概念C1vwjfは「抽象」で共通のため、これらの周辺語Vvwjfにおける存在数を基軸単語共起語「製造」に関して足し合わせた「1」が存在数Lc1jfとなる。なお、インターネット・サーバZに一般概念情報Cgの登録が無い「システム変更」、「高速処理」などの複合語は、共起語の単語をそのまま仮の概念として残して処理する。図11より、基軸単語共起語「製造」の大分類共起語概念ベクトルLc1jは{8、4、1、0、・・・}のように表される。   Further, for each peripheral word Vvwjf in the peripheral word composition table VV of FIG. 9, when a peripheral word concept C1vwjf, a peripheral word concept C2vwjf, and a peripheral word concept C3vwjf as shown in FIG. FIG. 11 is a co-occurrence word concept table VC2 of middle classification, FIG. 12 is a co-occurrence word concept table VC2, and FIG. 13 is a co-occurrence word concept table VC3 of a small classification. This is a table in which Cvwjf is arranged. The number of co-occurrence in the co-occurrence word concept table VC1, VC2, VC3 is “use”, “operation”, “construction”, “construction” in the peripheral word Vvwjf, taking the co-occurrence word concept table VC1 of the large classification as an example. Since the peripheral word concept C1vwjf of “improvement” and “operation” is common to “human activity”, the number of existences in these neighboring words Vvwjf is added to the same basic word co-occurrence word “manufacturing” and “8” is present. Lc1jf. Similarly, since the peripheral word concept C1vwjf of “mechanism”, “instantaneous”, and “short term” is common to “abstract” in the peripheral word Vvwjf, the existence number in these peripheral words Vvwjf is used as the basic word co-occurrence word “manufacturing”. “1” obtained by adding together is the existence number Lc1jf. It should be noted that compound words such as “system change” and “high-speed processing” for which general concept information Cg is not registered in the Internet server Z are processed while leaving the co-occurrence word as a temporary concept. From FIG. 11, the major co-occurrence word concept vector Lc1j of the basic word co-occurrence word “manufacturing” is represented as {8, 4, 1, 0,.

さらに文書解析システムYは、各共起語概念Cvwjに基づき各基軸単語共起語Vwj間の類似性Fwを算出し、任意の閾値よりも類似性Fwが大きい各基軸単語共起語Vwjをまとめ、各基軸単語共起語Vwjをクラスタリングすることで、各基軸単語共起語クラスタEwzを抽出する。   Furthermore, the document analysis system Y calculates the similarity Fw between each basic word co-occurrence word Vwj based on each co-occurrence word concept Cvwj, and summarizes each basic word co-occurrence word Vwj having a similarity Fw larger than an arbitrary threshold. Then, each basic word co-occurrence word Vwj is clustered to extract each basic word co-occurrence word cluster Ewz.

基軸単語共起語Vwp(j=p)と基軸単語共起語Vwq(j=q)の類似性Fwpqの算出方法の例としては、前記概念直接抽出法によって各共起語概念Cvwjを求めた場合は、共起語概念Cvwpと共起語概念Cvwqが同一の分類になる分類階層と分類体系における最も大まかな分類階層までの階層差によって定量化する。例えば、図8の例のように大分類(1層目)、中分類(2層目)、小分類(3層目)の3階層からなる分類体系を持つシソーラスで基軸単語共起語Vwp「製造」と基軸単語共起語Vwq「採掘」は小分類の共起語概念C3vwj「生産」で一致しているため、1層目と3層目の差分として「2」が類似性の指標となる。また、図8の例で類似性Fwの閾値を1以上とすると、中分類の共起語概念C2vwj以下で各基軸単語共起語Vwjがクラスタリングされることになり、図14に示すような点線で囲まれた5つのクラスタが基軸単語共起語クラスタEwzとして抽出される。   As an example of a method for calculating the similarity Fwpq between the basic word co-occurrence word Vwp (j = p) and the basic word co-occurrence word Vwq (j = q), each co-occurrence word concept Cvwj was obtained by the concept direct extraction method. In this case, the co-occurrence word concept Cvwp and the co-occurrence word concept Cvwq are quantified by the hierarchy difference between the classification hierarchy in which the classification is the same and the rough classification hierarchy in the classification system. For example, as shown in the example of FIG. 8, the basic word co-occurrence word Vwp “is a thesaurus having a classification system composed of three layers of a large classification (first layer), a middle classification (second layer), and a small classification (third layer). “Manufacturing” and the basic word co-occurrence word Vwq “mining” coincide with each other in the subcategory co-occurrence concept C3vwj “production”, and “2” is the difference index between the first and third layers. Become. Further, if the threshold value of similarity Fw is 1 or more in the example of FIG. 8, the basic word co-occurrence words Vwj are clustered below the middle-class co-occurrence word concept C2vwj, and a dotted line as shown in FIG. The five clusters surrounded by are extracted as basic word co-occurrence word clusters Ewz.

一方、前記概念間接抽出法によって各共起語概念Cvwjを求めた場合は、基軸単語共起語Vwpに対応する大分類共起語概念ベクトルLc1pと基軸単語共起語Vwqに対応する大分類共起語概念ベクトルLc1qの間のコサイン距離dc1pqと、中分類共起語概念ベクトルLc2pとLc2qの間のコサイン距離dc2pqと、小分類共起語概念ベクトルLc3pとLc3qの間のコサイン距離dc3pqとを算出し、以下の(1)式によりそれぞれの分類重み付け係数β1、β2、β3(β1<β2<β3)を掛けた和を基軸単語共起語間距離dwpqとして算出し、逆数など基軸単語共起語間距離dpqと単調減少の関係にある関数によって類似性Fwpqを算出する。この処理を全ての基軸単語共起語Vijの組合せについて行う。
dpq=β1×dc1pq+β2×dc2pq+β3×dc3pq・・・(1)式
On the other hand, when each co-occurrence word concept Cvwj is obtained by the concept indirect extraction method, the large-category co-occurrence word concept vector Lc1p corresponding to the basic word co-occurrence word Vwp and the large classification co-occurrence word Vwq are corresponded. Calculate cosine distance dc1pq between word concept vector Lc1q, cosine distance dc2pq between middle classification co-occurrence word concept vectors Lc2p and Lc2q, and cosine distance dc3pq between small classification co-occurrence word concept vectors Lc3p and Lc3q Then, the sum of the respective classification weighting coefficients β1, β2, and β3 (β1 <β2 <β3) is calculated as the basic word co-occurrence word distance dwpq by the following equation (1), and the basic word co-occurrence word such as the reciprocal number is calculated. The similarity Fwpq is calculated by a function that is in a monotonically decreasing relationship with the inter-distance dpq. This process is performed for all combinations of basic word co-occurrence words Vij.
dpq = β1 × dc1pq + β2 × dc2pq + β3 × dc3pq (1)

例えば、図11〜13の例では基軸単語「製造」と「蓄積」のコサイン距離は、dc1pq=0.26、dc2pq=0.57、dc3pq=0.68となり、分類重み付け係数をβ1=0.009、β2=0.09、β3=0.9とすると、基軸単語共起語間距離dpq=0.67となる。さらに、クラスタリング方法としては各基軸単語共起語Vwjを初期のクラスタと見なし、基軸単語共起語間距離dpqについて、最もクラスタ間距離が近いクラスタ同士を新しいクラスタとし、さらに新しい全てのクラスタ間の距離を求め、最も近い2つを結合して新しくクラスタを作るという処理を繰り返し、全てのクラスタが一つのクラスタに結合されるまで繰り返すことでデンドログラムを作成し、任意のクラスタ間距離基準でまとめられた基軸単語共起語Vwjの集団を基軸単語共起語クラスタEwzとする。図9〜13の情報に基づき、得られたデンドログラムでクラスタ間距離基準を5とした場合、図15に示すように2つのクラスタが基軸単語共起語クラスタEwzとして抽出される。   For example, in the examples of FIGS. 11 to 13, the cosine distance between the basic words “manufacturing” and “accumulation” is dc1pq = 0.26, dc2pq = 0.57, dc3pq = 0.68, and the classification weighting coefficient is β1 = 0. When 009, β2 = 0.09, and β3 = 0.9, the basic word co-occurrence distance dpq = 0.67. Further, as a clustering method, each basic word co-occurrence word Vwj is regarded as an initial cluster, and with respect to the basic word co-occurrence word distance dpq, the clusters having the closest inter-cluster distance are set as new clusters, and further, between all new clusters. Find the distance, repeat the process of joining the two closest and create a new cluster, create a dendrogram by repeating until all the clusters are combined into one cluster, and summarize with any intercluster distance criterion A group of the obtained basic word co-occurrence words Vwj is set as a basic word co-occurrence word cluster Ewz. Based on the information of FIGS. 9 to 13, when the inter-cluster distance reference is set to 5 in the obtained dendrogram, two clusters are extracted as the basic word co-occurrence word cluster Ewz as shown in FIG.

文書解析システムYは、特定の基軸単語Swに関する各基軸単語共起語Vwjのクラスタリング結果として得られた基軸単語共起語クラスタEwzについて、また各基軸単語共起語クラスタEwzに属する基軸単語共起語Vwjの共起数Nwjの総和をクラスタ規模Nwzとして抽出する。そして、クラスタ規模Nwzが任意に定めた閾値以上のクラスタが複数存在する基軸単語Swを、意味的に複数の用法が見られ、多義語の可能性が想定される多義語候補Awとして抽出する。
前記概念直接抽出法によって各共起語概念Cvwjを求めた図6〜9の例で、閾値を20%とすると、基軸単語共起語Vwjは25個あるため、5個以上の基軸単語共起語Vwjが属する基軸単語共起語クラスタEwzとして「産業」と「経済」の二つが抽出され、基軸単語Sw「資材」は多義語候補Awと判定される。これは基軸単語共起語クラスタEwz「産業」に属する基軸単語共起語Vwjの意味、および基軸単語共起語クラスタEwz「経済」に属する基軸単語共起語Vwjの意味から、「資材」は「材料」および「資材調達部門」の略語としての二つの意味を持った可能性が高く、このような多義性を見出すことができる。同様に、前記概念間接抽出法によって各共起語概念Cvwjを求めた図15の例で、閾値を20%とすると、2つのクラスタはいずれも5個以上の基軸単語共起語Vwjが属しているため基軸単語Sw「資材」は多義語候補Awと判定される。
さらに文書解析システムYは、多義語候補Awについて、要求文書Dで該当する多義語候補Awを色分けや太字による強調などの加工を行い、加工後の要求文書Dを、出力部から出力する。
The document analysis system Y uses the basic word co-occurrence word cluster Ewz obtained as a result of the clustering of the basic word co-occurrence words Vwj for the specific basic word Sw, and the basic word co-occurrence belonging to each basic word co-occurrence word cluster Ewz. The sum of the co-occurrence numbers Nwj of the word Vwj is extracted as the cluster scale Nwz. Then, the base word Sw in which a plurality of clusters having a cluster size Nwz equal to or greater than a predetermined threshold value is extracted as a multiple meaning word candidate Aw in which a plurality of usages are seen and a possibility of a multiple meaning is assumed.
In the examples of FIGS. 6 to 9 in which each co-occurrence word concept Cvwj is obtained by the concept direct extraction method, if the threshold is 20%, there are 25 basic word co-occurrence words Vwj, and therefore five or more basic word co-occurrence “Industry” and “Economy” are extracted as the basic word co-occurrence word cluster Ewz to which the word Vwj belongs, and the basic word Sw “material” is determined as the ambiguous word candidate Aw. From the meaning of the basic word co-occurrence word Vwj belonging to the basic word co-occurrence cluster Ewz “Industry” and the meaning of the basic word co-occurrence word Vwj belonging to the basic word co-occurrence word cluster Ewz “Economy”, “material” is There is a high possibility of having two meanings as abbreviations of “material” and “material procurement department”, and such ambiguity can be found. Similarly, in the example of FIG. 15 in which each co-occurrence word concept Cvwj is obtained by the concept indirect extraction method, if the threshold value is 20%, each of the two clusters includes five or more basic word co-occurrence words Vwj. Therefore, the basic word Sw “material” is determined as the ambiguous word candidate Aw.
Further, the document analysis system Y processes the multiple meaning word candidate Aw corresponding to the multiple meaning word candidate Aw in the request document D, such as color coding or bold emphasis, and outputs the processed request document D from the output unit.

次に、図10を参照して、具体的な第2の実施例を用いて、第2の実施形態に係る多義語抽出システム100Aの動作を説明する。
本第2の実施例では、多義語抽出システム100Aは、図16に示されるように、インターネット・サーバZ’を利用するものとする。
文書解析システムYaは、分析実施者Bの持つPC端末上で動作し、入力部及び出力部を介して、分析実施者Bが多義語を抽出したい文書群を構成する文章の入力と、多義語候補Aの提示を実現する。
Next, with reference to FIG. 10, the operation of the multiple meaning word extraction system 100A according to the second embodiment will be described using a specific second example.
In the second embodiment, it is assumed that the multiple meaning word extraction system 100A uses an Internet server Z ′ as shown in FIG.
The document analysis system Ya operates on the PC terminal possessed by the analyst B, and through the input unit and the output unit, inputs of sentences constituting a group of documents for which the analyst B wants to extract the polygram, Realization of candidate A is realized.

インターネット・サーバZ’は、既存のシソーラスを提供するサーバであり、通信ネットワークを介して文書解析システムYaを実装した分析実施者Bの持つPC端末と接続されている。インターネット・サーバZ’は、文書解析システムYaからの単語の概念情報の問い合わせに対して、単語の概念分類や一般的な同義語や類義語、用法に関連する一般概念情報Cgの検索を可能にする装置である。   The Internet server Z ′ is a server that provides an existing thesaurus, and is connected to a PC terminal possessed by an analysis person B who has implemented the document analysis system Ya via a communication network. The Internet server Z ′ enables retrieval of general concept information Cg related to word concept classification, general synonyms and synonyms, and usage in response to a query of word concept information from the document analysis system Ya. Device.

本第2の実施例では、第1の実施例の動作に加え、文書解析システムYaが構成語支配度算出部35と、複合語構成配分推定部36と、を更に含む。
すなわち、図16と図3との対応関係は次のように成る。
文書入力部10と、単語分析部20Aと、構成語支配度算出部35と、複合語構成配分推定部36と、基軸単語共起ベクトル抽出部30と、共起語概念推定部40Aと、共起語分類部50と、多義語候補推定部60とは、文書解析システムYa内に含まれている。多義語候補出力部70は、PC端末の出力部として動作する。概念データベース110はインターネット・サーバZ’内に含まれている。
In the second embodiment, in addition to the operation of the first embodiment, the document analysis system Ya further includes a constituent word dominance calculation unit 35 and a compound word constituent distribution estimation unit 36.
That is, the correspondence between FIG. 16 and FIG. 3 is as follows.
The document input unit 10, the word analysis unit 20A, the constituent word dominance calculation unit 35, the compound word composition distribution estimation unit 36, the basic word co-occurrence vector extraction unit 30, the co-occurrence word concept estimation unit 40A, The word classification unit 50 and the multiple meaning word candidate estimation unit 60 are included in the document analysis system Ya. The polysemy candidate output unit 70 operates as an output unit of the PC terminal. The concept database 110 is included in the Internet server Z ′.

この様な構成を含めた文書解析システムYaは、上述した第1の実施例に対して、以下のような動作を加える。
文書解析システムYaは、各基軸単語共起語Vijのそれぞれの一般概念情報Cgをインターネット・サーバZ’に問い合わせることで、インターネット・サーバZ’内に保存されたシソーラスに、各単語Vijの一般概念情報Cgが登録されているかどうかを検索し、シソーラスに一般概念情報Cgの登録が無く、かつ文字数が2文字以上の単語を複合語Vme(e=1、2、・・・、h)として抽出する。例えば「購買処理」という単語がシソーラスに登録されていない場合は、2文字以上であるため複合語として抽出する。
The document analysis system Ya including such a configuration adds the following operation to the first embodiment described above.
The document analysis system Ya makes an inquiry to the Internet server Z ′ for the general concept information Cg of each basic word co-occurrence word Vij, so that the general concept of each word Vij is stored in the thesaurus stored in the Internet server Z ′. Search whether the information Cg is registered, and extract a word having two or more characters as a compound word Vme (e = 1, 2,..., H) without registration of the general concept information Cg in the thesaurus. To do. For example, when the word “purchase processing” is not registered in the thesaurus, it is extracted as a compound word because it has two or more characters.

さらに文書解析システムYaは、複合語Vme毎に複合語Vmeの文字列をあらゆるパターンで分離し、分離した全ての部分文字列について、インターネット・サーバZ’内に保存されたシソーラスに一般概念情報Cgが登録されているかどうかを検索する。そして、一般概念情報の登録がない部分文字列の文字数が最も少なくなるパターンでの、部分文字列を複合語Vmeの構成語Pek(k=1、2、・・・、l)として処理し、構成語Pekの内、一般概念情報Cgの登録が有る部分文字列は有意構成語Paekとし、登録が無い部分文字列は不明構成語Pbekとして、それぞれ複合語毎に抽出する。
図6の「購買処理」という複合語の例では、{「購」、「買処理」}、{「購買」、「処理」}、{「購買処」、「理」}が分離可能な文字列として想定され、「買処理」と「購買処」がシソーラスに登録されていない場合は、「購」、「購買」、「処理」、「理」が有意構成語Paekの候補、「買処理」、「購買処」が不明構成語Pbekの候補となるが、一般概念情報Cgの登録がない部分文字列の文字数が最も少ない{「購買」、「処理」}の組合せが複合語「購買処理」の有意構成語として選択される。
Further, the document analysis system Ya separates the character string of the compound word Vme in every pattern for each compound word Vme, and the general concept information Cg is stored in the thesaurus stored in the Internet server Z ′ for all the separated partial character strings. Search if is registered. Then, the partial character string is processed as a constituent word Pek (k = 1, 2,..., L) of the compound word Vme in a pattern in which the number of characters of the partial character string without registration of general concept information is the smallest. Among the constituent words Pek, the partial character string in which the general concept information Cg is registered is extracted as a significant constituent word Paek, and the partial character string without registration is extracted as an unknown constituent word Pbek for each compound word.
In the example of the compound word “purchase process” in FIG. 6, {“purchase”, “buy process”}, {“purchase”, “process”}, {“purchase process”, “reason”} can be separated. If “buy process” and “purchase process” are not registered in the thesaurus, “buy”, “purchase”, “process”, and “reason” are candidates for significant constituent word Paek, “buy process” ”,“ Purchase ”is a candidate for the unknown constituent word Pbek, but the combination of {“ Purchase ”,“ Process ”}} having the smallest number of characters in the partial character string for which the general concept information Cg is not registered is the compound word“ Purchase As a significant constituent word.

文書解析システムYaは、「構築する情報システムの機能」など文書Dで一定の範囲の内容に言及している文章群として分析者Bが指定した段落の文章内で複合語Vmeと共起する名詞、および複合語Vmeに係る動詞と形容詞、形容動詞をs個の複合語共起語Umer(r=1、2、・・・、s)として、複合語Vme毎に複合語共起語Umerと、共起と見なした範囲内での共起回数Merを抽出し、各行を各複合語Vmeに各列を各複合語共起語Umerに対応させ、複合語Vmeに対する複合語共起語Umerの共起回数Merを各値として登録した疎行列からなる複合語共起表VUmを作成する。
さらに、文書解析システムYaは、前記複合語共起表VUmの各構成語Pek別に、同じ構成語Px(x=1、2、・・・、t)を含むt個の複合語Vmxの行成分(Mx1,Mx2,Mx3,・,・,・,Mxs)を抽出し、各行成分を各複合語Vmxに、各列を各複合語共起語Umxrに対応させ、複合語Vmxに対する複合語共起語Umxrの共起回数Mxrを各値として登録した疎行列からなる部分一致複合語共起表VUxを作成する。
例えば「処理」という構成語を含む部分一致複合語共起表としては図17、「変更」という構成語を含む部分一致複合語共起表としては図18のような表が作成される。さらに、文書解析システムYaは、以下の数1のように、部分一致複合語共起表VUxの複合語共起語Umxr毎のデータ列(M1r,M2r,M3r,・,・,・,Mtr)で分散σxrを算出し、全複合語共起語Umxrの分散σxrの平均値の平方根の逆数を構成語Pxの構成語支配度Gxとして算出する。

Figure 2013020431
The document analysis system Ya is a noun that co-occurs with the compound word Vme in the sentence of the paragraph specified by the analyst B as a sentence group that refers to the contents of a certain range in the document D such as “function of the information system to be constructed”. , And the verb and adjective related to the compound word Vme, and the adjective verb as s compound word co-occurrence words Umer (r = 1, 2,..., S), and for each compound word Vme, the compound word co-occurrence word Umer and , The co-occurrence number Mer within the range regarded as co-occurrence is extracted, each row is associated with each compound word Vme, each column is associated with each compound word co-occurrence word Umer, and the compound word co-occurrence word Umer with respect to the compound word Vme. A compound word co-occurrence table VUm composed of a sparse matrix in which the co-occurrence number Mer is registered as each value is created.
Further, the document analysis system Ya uses the row components of t compound words Vmx including the same component word Px (x = 1, 2,..., T) for each component word Pek of the compound word co-occurrence table VUm. (Mx1, Mx2, Mx3,..., Mxs) are extracted, each row component is associated with each compound word Vmx, each column is associated with each compound word co-occurrence word Umxr, and compound word co-occurrence with respect to the compound word Vmx A partially matched compound word co-occurrence table VUx composed of a sparse matrix in which the co-occurrence count Mxr of the word Umxr is registered as each value is created.
For example, a table as shown in FIG. 17 is created as the partially matching compound word co-occurrence table including the constituent word “processing”, and FIG. 18 is created as the partially matching compound word co-occurrence table including the constituent word “change”. Further, the document analysis system Ya uses a data string (M1r, M2r, M3r,..., Mtr) for each compound word co-occurrence word Umxr of the partial match compound word co-occurrence table VUx as shown in the following equation (1). Then, the variance σxr is calculated, and the reciprocal of the square root of the average value of the variance σxr of all compound word co-occurrence words Umxr is calculated as the constituent word dominance Gx of the constituent word Px.
Figure 2013020431

文書解析システムYaは、複合語Vme毎の各構成語Pekに対応する各構成語支配度Gxekの値を構成語支配度Gxekの総和で除すことで正規化した構成語重み付け係数αekを算出し、各行を各複合語Vmeに各列を各構成語Pekに対応させ、複合語Vmeに対する各構成語Pekの構成語重み付け係数αekを各値として登録した疎行列からなる、複合語構成配分表Teを作成する。
例えば、図6の基軸単語共起語の内で複合語であった「変更処理」、「購買処理」に関して、構成語「処理」の構成語支配度Gxが1.47で、構成語「変更」の構成語支配度Gxが2.21、構成語「購買」の構成語支配度Gxが3.43であった場合、複合語構成配分表Teは図19のようになる。図19は、複合語「変更処理」を構成語「変更」と構成語「処理」の組合せとして理解する場合、構成語「変更」の方が構成語「処理」よりも重要であることを示している。
The document analysis system Ya calculates a normalized constituent weighting coefficient αek by dividing the value of each constituent word dominance Gxek corresponding to each constituent word Pek for each compound word Vme by the sum of the constituent word dominance degrees Gxek. , Each word is associated with each compound word Vme, each column is associated with each component word Pek, and each component word Pek with respect to the compound word Vme is registered as each value as a compound word composition distribution table Te. Create
For example, with respect to “change processing” and “purchase processing” that are compound words in the basic word co-occurrence words in FIG. 6, the constituent word control degree Gx of the constituent word “processing” is 1.47, and the constituent word “change” 19 is 2.21, and the constituent word dominance Gx of the constituent “purchase” is 3.43, the compound word constituent distribution table Te is as shown in FIG. FIG. 19 shows that when the compound word “change process” is understood as a combination of the constituent word “change” and the constituent word “process”, the constituent word “change” is more important than the constituent word “process”. ing.

文書解析システムYaは、複合語Vmeが基軸単語共起語Vijの一つであるという観点から、特定の基軸単語Swと共起した複合語Vmweの構成語Pekを、それぞれ基軸単語共起語Vmwekとして独立させる。そして、複合語構成配分表Teに基づき複合語Vmweの共起数Nweに各構成語Pekに対応する構成語重み付け係数αekを掛けた値を共起数Nwekとして算出することで、基軸単語共起ベクトルNwを変更する。図6の基軸単語「資材」について詳細に説明すれば、複合語である「変更処理」と「購買処理」の構成語「処理」と「変更」、および「購買」と「処理」が基軸単語共起語として独立し、図19に示しているように「変更処理」の構成語重み付け係数が「変更=0.6」、「処理=0.4」で、同様に「購買処理」の構成語重み付け係数が「購買=0.7」と「処理=0.3」であるので、重み付け共起数Nwekは「変更:2.4=4×0.6」、「処理:2.5=4×0.4+3×0.3」、「購買:2.1=3×0.7」となる。他の各基軸単語共起語Vwも同様に処理を行い、図6に示した基軸単語共起ベクトルは図20に示す基軸単語共起ベクトルのように変換される。   From the viewpoint that the compound word Vme is one of the basic word co-occurrence words Vij, the document analysis system Ya selects the basic word co-occurrence word Vmwe for each of the constituent words Pek of the compound word Vmwe that co-occurs with the specific basic word Sw. As independent. Then, based on the compound word composition distribution table Te, a value obtained by multiplying the co-occurrence number Nwe of the compound word Vmwe by the component word weighting coefficient αek corresponding to each component word Pek is calculated as the co-occurrence number Nwek, so that the basic word co-occurrence The vector Nw is changed. The key word “material” in FIG. 6 will be described in detail. The compound words “change” and “purchase”, which are compound words “process” and “change”, and “purchase” and “process” are key words. As a co-occurrence word, as shown in FIG. 19, the component word weighting coefficient of “change process” is “change = 0.6”, “process = 0.4”, and similarly the configuration of “purchase process” Since the word weighting coefficients are “purchasing = 0.7” and “processing = 0.3”, the weighted co-occurrence number Nwek is “change: 2.4 = 4 × 0.6”, “processing: 2.5 = 4 × 0.4 + 3 × 0.3 ”and“ Purchase: 2.1 = 3 × 0.7 ”. The other basic word co-occurrence words Vw are processed in the same manner, and the basic word co-occurrence vector shown in FIG. 6 is converted into the basic word co-occurrence vector shown in FIG.

文書解析システムYaの他の動作は第一の実施例と同様である。例えば、図20の基軸単語Sw「資材」に関する基軸単語共起ベクトルNwの共起数Nwjの各基軸単語共起語Vwjについて、前記概念直接抽出法によって、図21のような共起語概念C1vwj、共起語概念C2vwj、共起語概念C3vwjが抽出された場合、共起語概念図Cvwjは図22のような樹形図で表される。さらに、図22の例で類似性Fwの閾値を1以上とすると、中分類の共起語概念C2vwj以下で各基軸単語共起語Vwjがクラスタリングされることになり、図23に示すような点線で囲まれた3つのクラスタが基軸単語共起語クラスタEwzとして抽出される。クラスタ規模の閾値を第一の実施例よりも高く30%としても、基軸単語共起語Vwjが属する基軸単語共起語クラスタEwzとして「産業」と「経済」の二つが抽出され、基軸単語Sw「資材」は多義語候補Awと判定される。このように辞書に登録されていない未知の複合語を、構成語毎に考慮することでより多くの基軸単語共起語を考慮した多義語の推定が精度よく可能となる。   Other operations of the document analysis system Ya are the same as those in the first embodiment. For example, for each basic word co-occurrence word Vwj of the co-occurrence number Nwj of the basic word co-occurrence vector Nw regarding the basic word Sw “material” in FIG. 20, the co-occurrence word concept C1vwj as shown in FIG. When the co-occurrence word concept C2vwj and the co-occurrence word concept C3vwj are extracted, the co-occurrence word concept diagram Cvwj is represented by a tree diagram as shown in FIG. Furthermore, if the threshold value of similarity Fw is 1 or more in the example of FIG. 22, each basic word co-occurrence word Vwj is clustered below the middle-class co-occurrence word concept C2vwj, and a dotted line as shown in FIG. Are extracted as the basic word co-occurrence word cluster Ewz. Even if the threshold of the cluster scale is set to 30% higher than in the first embodiment, two of “industry” and “economy” are extracted as the basic word co-occurrence word cluster Ewz to which the basic word co-occurrence word Vwj belongs, and the basic word Sw “Material” is determined to be an ambiguous word candidate Aw. As described above, unknown compound words that are not registered in the dictionary are taken into account for each constituent word, so that it is possible to accurately estimate a polysemy that takes into account more basic word co-occurrence words.

以上説明したように、本発明の多義語抽出システムによれば、情報システム構築に関する提案書や仕様書等といった 所定の案件に関する文書内で複数の意味を割り当てられている多義語のある文書について、その文書で成り立っている多義語を把握することが分析に使用した文書又は文書群から可能となる。もって、情報システムの構築時に、誤解に基づく混乱や失敗などの削減につながる。その理由は、単語の共起語の類似性を概念レベルでの一致具合で算出し、共起語をクラスタリングすることで、特定の案件に関する文書群という限られた文書量の情報で同一の共起語の使用が無くても、用法的に複数の共起語群を持ち多義である可能性の高い単語を抽出可能にしているためである。   As described above, according to the polysemy extraction system of the present invention, a document having a polysemy that is assigned a plurality of meanings in a document related to a predetermined matter such as a proposal or a specification for information system construction, It is possible to grasp the ambiguous words made up of the document from the document or document group used for the analysis. As a result, confusion and failure due to misunderstandings can be reduced during the construction of information systems. The reason for this is that the similarity of co-occurrence words of words is calculated based on the matching level at the concept level, and the co-occurrence words are clustered, so that the same co-occurrence information can be obtained with a limited amount of documents related to a specific project. This is because even if there is no use of words, it is possible to extract words that have a plurality of co-occurrence word groups and are likely to be ambiguous.

また、本発明の具体的な構成は前述の実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の変更があってもこの発明に含まれる。   In addition, the specific configuration of the present invention is not limited to the above-described embodiment, and changes within a range not departing from the gist of the present invention are included in the present invention.

例えば、一般概念と異なる概念での用法を有する多義語を含む文書中からその多義語を所要に抽出するため、多義語抽出システムとして動作する情報処理装置を、入力部から受け付けた文書から多義語を抽出する際に、文章として使用されている各単語の抽出を行うと共に、抽出した単語群から任意の単語を基軸単語として共起関係を有する基軸単語共起語及び共起数から前記基軸単語の基軸単語共起ベクトルを抽出し、該基軸単語共起ベクトルに含まれる各基軸単語共起語の共起語概念を個々に推定し、推定した共起語概念間の類似性に基づいて含まれていた各基軸単語共起語をクラスタ化し、前記任意の単語として選択した基軸単語に関して複数のクラスタが存在した際に該基軸単語を多義語候補とする処理を繰り返して、抽出した多義語候補を出力部から出力する。   For example, an information processing device that operates as a polysemy extraction system is extracted from a document received from an input unit in order to extract the polysemy as necessary from a document containing polysemy having a usage in a concept different from the general concept. Are extracted from each word used as a sentence, and the basic word from the extracted word group having a co-occurrence relationship with an arbitrary word as a basic word and the basic word from the co-occurrence number The basic word co-occurrence vector is extracted, the co-occurrence word concept of each basic word co-occurrence word included in the basic word co-occurrence vector is estimated individually, and included based on the similarity between the estimated co-occurrence word concepts Each of the basic word co-occurrence words is clustered, and when there are a plurality of clusters related to the basic word selected as the arbitrary word, the process of setting the basic word as a multiple word candidate is repeated and extracted. And it outputs a candidate from the output unit.

この際に、分析対象とする文書毎(文章群毎)に重み付けを与えられるようにしてもよい。
例えば、確度の高い文書とそうではない文書とを重み付けと共に入力を受け付けて、各係数として使用してもよい。
また、分析対象とする文書群(文章群)の作成者や所属機関などについて重み付けを与えられるようにしてもよい。
また、文書群の有する引用関係や引用数に基づいて重み付けを算定してもよい。
また、翻訳されている文章についてその多義語を抽出する際に、使用する一般概念を翻訳前の元言語の一般概念を使用するようにしてもよい。
これらの情報は、操作者から受け付けることとしてもよいし、文章を構文解析や意味解析などの自然言語解析を行って自動的に抽出するようにしてもよい。
また、自然言語解析によって、使用する概念推定方法などのアルゴリズムを、適する候補の抽出や自動選択を行うようにしてもよい。
At this time, a weight may be given to each document to be analyzed (each sentence group).
For example, a highly accurate document and a document that is not so may be input together with weighting and used as each coefficient.
Also, the creator of the document group (sentence group) to be analyzed or the affiliated organization may be weighted.
Also, the weight may be calculated based on the citation relationship and the number of citations that the document group has.
Moreover, when extracting the polysemy about the translated sentence, you may make it use the general concept of the original language before translation for the general concept to be used.
These pieces of information may be received from the operator, or the sentence may be automatically extracted by performing natural language analysis such as syntax analysis or semantic analysis.
Further, suitable candidates may be extracted or automatically selected by an algorithm such as a concept estimation method to be used by natural language analysis.

本発明によれば、ソフトウェアやシステムの開発における要件定義などの作業においてやり取りされる各種文書に関して、文書の曖昧さを除外することで文書の理解・作成・修正を支援することが可能になり、手戻りの減少や顧客満足の向上などシステム開発の効率化に関する用途に適用できる。
また、多義語を精度よく抽出できるので、翻訳システムに用いて訳し分けに利用できる。
According to the present invention, it is possible to support understanding, creation, and correction of documents by excluding ambiguity of documents for various documents exchanged in work such as requirement definition in software and system development. It can be applied to applications related to streamlining system development, such as reducing rework and improving customer satisfaction.
In addition, since multiple terms can be extracted with high accuracy, they can be used in translation systems for translation.

10 文書入力部
20、20A 単語分析部
30 基軸単語共起ベクトル抽出部
35 構成語支配度算出部
36 複合語構成配分推定部
40、40A 共起語概念推定部
50 共起語分類部
60 多義語候補推定部
70 多義語候補出力部
100、100A 多義語抽出システム
D 文書
Y、Ya 文書解析システム
Z、Z’ インターネット・サーバ
DESCRIPTION OF SYMBOLS 10 Document input part 20, 20A Word analysis part 30 Basic word co-occurrence vector extraction part 35 Composition word dominance calculation part 36 Compound word structure allocation estimation part 40, 40A Co-occurrence word concept estimation part 50 Co-occurrence word classification part 60 Ambiguous word Candidate estimation section 70 Ambiguous word candidate output section 100, 100A Ambiguous word extraction system D Document Y, Ya Document analysis system Z, Z 'Internet server

Claims (18)

入力を受けた所定の文章に使用されている各単語の抽出を行う単語分析部と、
前記単語の内で任意の単語を基軸単語として選択し、該基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルを抽出する基軸単語共起ベクトル抽出部と、
基軸単語共起ベクトルの各基軸単語共起語の共起語概念を一般概念から推定する共起語概念推定部と、
推定した共起語概念群について、対応する共起語概念間の類似性に基づき、前記選択した基軸単語に関する各基軸単語共起語のクラスタリングを行う共起語分類部と、
前記選択した基軸単語に関して複数のクラスタが存在した際に該基軸単語を多義語候補として抽出する多義語候補推定部と、
抽出した多義語候補を出力する多義語候補出力部と、
を備えたことを特徴とする多義語抽出システム。
A word analysis unit that extracts each word used in a predetermined sentence that has been input;
A base word that selects an arbitrary word from the words as a base word, and extracts a base word co-occurrence vector represented by the base word co-occurrence word and the co-occurrence number considered to be a co-occurrence relationship with the base word A co-occurrence vector extraction unit;
A co-occurrence word concept estimation unit that estimates a co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector from a general concept;
About the estimated co-occurrence word concept group, based on the similarity between corresponding co-occurrence word concepts, a co-occurrence word classification unit that performs clustering of each of the basic word co-occurrence words related to the selected basic word,
A multiple word candidate estimation unit that extracts the basic word as a multiple word candidate when a plurality of clusters exist with respect to the selected basic word;
A multiple word candidate output unit that outputs the extracted multiple word candidates;
A polysemy extraction system characterized by comprising
対象とする文書もしくは文書群の入力を受け付ける文書入力部と、
文書もしくは文書群を構成する文章に使用されている各単語の抽出および単語毎の品詞や格、組み合される助詞、単語間の係り受け関係に関する単語情報の抽出を行う単語分析部と、
前記単語の内で任意の単語を基軸単語として選択して、基軸単語毎の単語情報に基づき、任意の基軸単語共起判定ルールで基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルをそれぞれ抽出する基軸単語共起ベクトル抽出部と、
単語の概念分類、同義語、類義語、用法といった単語の一般概念を体系付けた一般概念情報を収集して蓄積し、特定の単語に関する問い合わせに対し、単語の意味や用法に関連する一般概念情報を検索し応答する概念データベースと、
概念データベースの一般概念情報を利用し、所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する共起語概念推定部と、
特定の基軸単語に関する各基軸単語共起語について、対応する前記共起語概念間の類似性を所定の類似性指標によって算出し、前記共起語概念間の類似性指標に基づき各基軸単語共起語のクラスタリングを行う共起語分類部と、
各基軸単語に関する各基軸単語共起語の各クラスタの規模が任意に定めた閾値以上のクラスタが複数存在する基軸単語を、多義語候補として抽出する多義語候補推定部と、
抽出した多義語候補を出力する多義語候補出力部と、
を備えたことを特徴とする多義語抽出システム。
A document input unit for receiving input of a target document or document group, and
A word analysis unit for extracting each word used in a sentence constituting a document or a document group, and extracting word information about a part of speech and a case for each word, a particle to be combined, and a dependency relation between words;
An arbitrary word is selected as a basic word among the words, and a basic word co-occurrence word that is regarded as a co-occurrence relationship with the basic word in an arbitrary basic word co-occurrence determination rule based on word information for each basic word and its word A basic word co-occurrence vector extracting unit for extracting a basic word co-occurrence vector represented by the co-occurrence number;
Collects and accumulates general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and collects general concept information related to the meaning and usage of words for inquiries about specific words A conceptual database that searches and responds;
A co-occurrence word concept estimator that estimates the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector based on a predetermined concept estimation method using general concept information of the concept database;
For each basic word co-occurrence word for a specific basic word, the similarity between the corresponding co-occurrence word concepts is calculated by a predetermined similarity index, and each basic word co-occurrence word is calculated based on the similarity index between the co-occurrence word concepts. A co-occurrence classifier that clusters words,
A multiple word candidate estimator for extracting, as a multiple word candidate, a basic word in which a plurality of clusters equal to or larger than a threshold that is arbitrarily determined by the size of each cluster of each basic word co-occurrence word for each basic word;
A multiple word candidate output unit that outputs the extracted multiple word candidates;
A polysemy extraction system characterized by comprising
入力を受けた所定の文章に使用されている各単語の抽出を行い、各単語の内で複合語とその構成語を抽出する単語分析部と、
構成語毎に構成語支配度を算出する構成語支配度算出部と、
各構成語支配度を使用して複合語毎に構成語重み付け係数を算出する複合語構成配分推定部と、
前記単語の内で任意の単語を基軸単語として選択し、該基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルを抽出する基軸単語共起ベクトル抽出部と、
基軸単語共起表の基軸単語共起ベクトルの各共起語の内で複合語になっている基軸単語共起語について、各構成語をそれぞれ基軸単語共起語として扱い、前記基軸単語共起語の共起数に各構成語の構成語重み付け係数を掛けて算出した値を各構成語の共起数として基軸単語共起ベクトルを更新し、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を一般概念から推定する共起語概念推定部と、
推定した共起語概念群について、対応する共起語概念間の類似性に基づき、前記選択した基軸単語に関する各基軸単語共起語のクラスタリングを行う共起語分類部と、
前記選択した基軸単語に関して複数のクラスタが存在した際に該基軸単語を多義語候補として抽出する多義語候補推定部と、
抽出した多義語候補を出力する多義語候補出力部と、
を備えたことを特徴とする多義語抽出システム。
A word analysis unit that extracts each word used in a predetermined sentence that has been input and extracts a compound word and its constituent words within each word;
A constituent word dominance calculation unit for calculating a constituent word dominance for each constituent word;
A compound word composition distribution estimator that calculates a composition word weighting coefficient for each compound word using each constituent word dominance degree;
A base word that selects an arbitrary word from the words as a base word, and extracts a base word co-occurrence vector represented by the base word co-occurrence word and the co-occurrence number considered to be a co-occurrence relationship with the base word A co-occurrence vector extraction unit;
For each basic word co-occurrence word that is a compound word in each basic word co-occurrence vector of the basic word co-occurrence table, each constituent word is treated as a basic word co-occurrence word, and the basic word co-occurrence The basic word co-occurrence vector is updated using the value calculated by multiplying the number of word co-occurrence by the word weighting coefficient of each constituent word as the co-occurrence number of each constituent word, and each basic word co-occurrence word of the basic word co-occurrence vector A co-occurrence word concept estimation unit that estimates a co-occurrence word concept from a general concept;
About the estimated co-occurrence word concept group, based on the similarity between corresponding co-occurrence word concepts, a co-occurrence word classification unit that performs clustering of each of the basic word co-occurrence words related to the selected basic word,
A multiple word candidate estimation unit that extracts the basic word as a multiple word candidate when a plurality of clusters exist with respect to the selected basic word;
A multiple word candidate output unit that outputs the extracted multiple word candidates;
A polysemy extraction system characterized by comprising
対象とする文書もしくは文書群の入力を受け付ける文書入力部と、
単語の概念分類、同義語、類義語、用法といった単語の一般概念を体系付けた一般概念情報を収集して蓄積し、特定の単語に関する問い合わせに対し、単語の意味や用法に関連する一般概念情報を検索し応答する概念データベースと、
文書もしくは文書群を構成する文章に使用されている各単語の抽出および単語毎の品詞や格、組み合される助詞、単語間の係り受け関係に関する単語情報の抽出を行い、概念データベースに抽出された各単語で一般概念情報の登録が無く、かつ文字数が2文字以上の単語を複合語として抽出し、複合語を構成するあらゆる部分文字列について、一般概念情報の登録がある部分文字列を複合語の有意構成語として抽出し、登録が無い部分文字列を不明構成語として抽出する単語分析部と、
各単語の単語情報、および複合語に基づき、複合語共起判定ルールで複合語と共起する単語を複合語共起語として、複合語毎に複合語共起語とその共起数を抽出し、これらをまとめることで複合語共起表を作成し、前記複合語共起表から同じ構成語を含む部分一致複合語の複合語共起語からなる複合語共起ベクトルを抽出し、構成語別に部分一致複合語共起表を作成し、部分一致複合語共起表の複合語共起ベクトルから得られる共起ベクトル空間における各部分一致複合語間の集約度を構成語支配度として算出する構成語支配度算出部と、
各構成語支配度を使用して複合語毎の各構成語間の構成語重み付け係数を算出し、構成語重み付け係数をまとめた複合語構成配分表を作成する複合語構成配分推定部と、
前記単語の内で任意の単語を基軸単語として選択して、基軸単語毎の単語情報に基づき、任意の基軸単語共起判定ルールで基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルをそれぞれ抽出する基軸単語共起ベクトル抽出部と、
基軸単語共起表の基軸単語共起ベクトルの各共起語の内で複合語になっている基軸単語共起語について、各構成語をそれぞれ基軸単語共起語として扱い、複合語構成配分表に基づき、前記基軸単語共起語の共起数に各構成語の構成語重み付け係数を掛けて算出した値を各構成語の共起数として基軸単語共起ベクトルを更新し、概念データベースの一般概念情報を利用し、所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する共起語概念推定部と、
前記任意の基軸単語に関する各基軸単語共起語について、対応する前記共起語概念間の類似性を所定の類似性指標によって算出し、前記共起語概念間の類似性指標に基づき各基軸単語共起語のクラスタリングをそれぞれ行う共起語分類部と、
各基軸単語に関する各基軸単語共起語の各クラスタの規模が任意に定めた閾値以上のクラスタが複数存在する基軸単語を、多義語候補として抽出する多義語候補推定部と、
抽出した多義語候補を出力する多義語候補出力部と、
を備えたことを特徴とする多義語抽出システム。
A document input unit for receiving input of a target document or document group, and
Collects and accumulates general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and collects general concept information related to the meaning and usage of words for inquiries about specific words A conceptual database that searches and responds;
Extract each word used in a sentence that composes a document or group of documents, and extract word information related to part-of-speech and case for each word, particle to be combined, and dependency relation between words. A word with no general concept information registered and two or more characters is extracted as a compound word, and a partial character string with general concept information registered is extracted from any partial character string constituting the compound word. A word analysis unit that extracts as a significant constituent word and extracts a partial character string without registration as an unknown constituent word;
Based on the word information of each word and the compound word, the compound word co-occurrence rule is used as a compound word co-occurrence word, and the compound word co-occurrence word and the number of co-occurrence are extracted for each compound word. By combining these, a compound word co-occurrence table is created, and a compound word co-occurrence vector composed of compound word co-occurrence words of partially matching compound words including the same component word is extracted from the compound word co-occurrence table, and configured Create a partially matched compound word co-occurrence table for each word, and calculate the degree of aggregation between each partially matched compound word in the co-occurrence vector space obtained from the compound word co-occurrence vector of the partially matched compound word co-occurrence table as the constituent word dominance A constituent word dominance degree calculation unit,
Calculating a constituent word weighting coefficient between constituent words for each compound word using each constituent word dominance degree, and creating a composite word constituent distribution table that summarizes the constituent word weighting coefficients;
An arbitrary word is selected as a basic word among the words, and a basic word co-occurrence word that is regarded as a co-occurrence relationship with the basic word in an arbitrary basic word co-occurrence determination rule based on word information for each basic word and its word A basic word co-occurrence vector extracting unit for extracting a basic word co-occurrence vector represented by the co-occurrence number;
For the basic word co-occurrence words that are compound words among the co-occurrence words in the basic word co-occurrence vector of the basic word co-occurrence table, each component word is treated as a basic word co-occurrence word, and the compound word composition distribution table The basic word co-occurrence vector is updated using the value calculated by multiplying the co-occurrence number of the basic word co-occurrence word by the constituent word weighting coefficient of each constituent word as the co-occurrence number of each constituent word. A co-occurrence word concept estimation unit that estimates the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector based on a predetermined concept estimation method using concept information;
For each basic word co-occurrence word related to the arbitrary basic word, the similarity between the corresponding co-occurrence word concepts is calculated by a predetermined similarity index, and each basic word is based on the similarity index between the co-occurrence word concepts A co-occurrence word classifier that performs clustering of co-occurrence words,
A multiple word candidate estimator for extracting, as a multiple word candidate, a basic word in which a plurality of clusters equal to or larger than a threshold that is arbitrarily determined by the size of each cluster of each basic word co-occurrence word for each basic word;
A multiple word candidate output unit that outputs the extracted multiple word candidates;
A polysemy extraction system characterized by comprising
前記基軸単語共起ベクトル抽出部における基軸単語共起判定ルールが、基軸単語と係り受けの関係にある単語を共起語と見なすルール、又は基軸単語と同一の文内で特定の助詞を伴って使用されている単語を共起語と見なすルールを用いることを特徴とする請求項2又は4に記載の多義語抽出システム。   The base word co-occurrence determination rule in the base word co-occurrence vector extraction unit is a rule that regards a word having a dependency relationship with the base word as a co-occurrence word, or with a specific particle in the same sentence as the base word The multi-word extraction system according to claim 2 or 4, wherein a rule that regards a used word as a co-occurrence word is used. 前記概念データベースは、単語を分類体系付けて記憶しており、単語間の同義関係、類義関係、上位/下位関係、部分/全体関係について、一般概念情報として取得できるシソーラスである、ことを特徴とする請求項2又は4に記載の多義語抽出システム。   The concept database stores a word with a classification system, and is a thesaurus that can be acquired as general concept information about synonymous relationships, synonymous relationships, upper / lower relationships, and partial / whole relationships between words. The multiple meaning word extraction system according to claim 2 or 4. 前記共起語概念推定部の概念推定方法が、各基軸単語共起語に関する一般概念情報を前記概念データベースに問い合わせ、特定の基軸単語の全基軸単語共起語を一般概念情報概念に置き換えた基軸単語共起概念ベクトルを共起語概念とし、前記共起語分類部が、全基軸単語共起語が同一の一般概念情報概念と見なされるまでの分類の深さを類似性指標としてクラスタリングを行う、ことを特徴とする請求項6に記載の多義語抽出システム。   The concept estimation method of the co-occurrence word concept estimation unit queries the concept database for general concept information related to each basic word co-occurrence word, and replaces all basic word co-occurrence words of a specific basic word with the general concept information concept The word co-occurrence concept vector is a co-occurrence word concept, and the co-occurrence word classification unit performs clustering using the depth of classification until all basic word co-occurrence words are regarded as the same general concept information concept as a similarity index. The multiple meaning word extraction system according to claim 6. 前記共起語概念推定部の概念推定方法が、基軸単語共起語について任意の周辺語判定ルールで基軸単語共起語の周辺に存在する周辺語とその存在数に基づく周辺語構成ベクトルを全基軸単語共起語についてまとめた周辺語構成表を作成し、周辺語構成表の周辺語構成ベクトルにおける各周辺語のそれぞれについて、前記概念データベースに一般概念情報を問い合わせ、任意の範囲内で周辺語構成表における各周辺語構成ベクトルの各周辺語を一般概念に変換した周辺語概念ベクトルを対応する基軸単語共起語毎に作成し、特定の基軸単語の全基軸単語共起語に対応する周辺概念ベクトルをまとめた基軸単語共起概念表を共起語概念とし、
前記共起語分類部が、階層毎に各基軸単語共起語に対応する前記周辺語概念ベクトル間の距離を算出し、より詳細な分類での距離ほど重視するように重み付けた距離と単調減少の関係にある関数値を類似性指標としてクラスタリングを行う、ことを特徴とする請求項6に記載の多義語抽出システム。
The concept estimation method of the co-occurrence word concept estimation unit calculates all peripheral words that exist around the base word co-occurrence word and the number of peripheral words based on the number of presence words based on any peripheral word determination rule for the base word co-occurrence word. Create a peripheral word composition table summarizing basic word co-occurrence words, inquire general concept information from the concept database for each of the peripheral words in the peripheral word composition vector of the peripheral word composition table, and Peripheral word corresponding to all basic word co-occurrence words of a specific basic word is created for each corresponding basic word co-occurrence word by creating a peripheral word concept vector obtained by converting each peripheral word of each peripheral word constituent vector in the composition table into a general concept A basic word co-occurrence concept table that summarizes concept vectors is used as a co-occurrence word concept.
The co-occurrence word classifying unit calculates the distance between the peripheral word concept vectors corresponding to each basic word co-occurrence word for each hierarchy, and the distance weighted so as to emphasize the distance in the more detailed classification and the monotonic decrease The multi-word extraction system according to claim 6, wherein clustering is performed using function values having a relationship of
前記共起語概念推定部の概念推定方法における任意の周辺語判定ルールが、1文内で共存する動詞、および目次上の同一項目内の文章内の名詞のように品詞毎に周辺とみなす範囲を変えるアルゴリズムを含む、ことを特徴とする請求項8に記載の多義語抽出システム。   The range in which the arbitrary peripheral word determination rule in the concept estimation method of the co-occurrence word concept estimation unit regards as a periphery for each part of speech such as a verb that coexists in one sentence and a noun in a sentence in the same item on the table of contents. The system of claim 8, further comprising an algorithm for changing 前記構成語支配度算出部の複合語共起判定ルールが、品詞が動詞であれば係り受け関係が有る単語、名詞であれば同一段落内の単語のように、品詞毎に共起と見なす範囲をおよび条件を変えて複合語共起語の抽出および複合語共起数の算出を行うアルゴリズムを含む、ことを特徴とする請求項4乃至9のいずれか1項に記載の多義語抽出システム。   Range in which the compound word co-occurrence determination rule of the constituent word dominance calculation unit considers co-occurrence for each part of speech, such as a word having a dependency relationship if the part of speech is a verb, or a word in the same paragraph if the part of speech is a noun 10. The multi-word extraction system according to claim 4, further comprising an algorithm that extracts a compound word co-occurrence word and calculates a compound word co-occurrence number under different conditions. 前記構成語支配度算出部における部分一致複合語間の集約度が、各部分一致複合語に対応するベクトル間の散らばりの小ささを表す指標として、ばらつきを示す指標と単調減少の関係にある関数で算出される、ことを特徴とする請求項4乃至10のいずれか1項に記載の多義語抽出システム。   A function in which the degree of aggregation between the partially matched compound words in the constituent word dominance calculating unit is a monotonously decreasing function as an index indicating the degree of dispersion between vectors corresponding to each partially matched compound word The multi-word extraction system according to any one of claims 4 to 10, which is calculated by: 前記構成語支配度算出部における部分一致複合語間の集約度が、共起語の品詞によって重み付けを行ったベクトル空間に基づいて算出される、ことを特徴とする請求項4乃至11のいずれか1項に記載の多義語抽出システム。   The degree of aggregation between partially matched compound words in the constituent word dominance calculation unit is calculated based on a vector space weighted by the part of speech of a co-occurrence word. The ambiguous word extraction system according to item 1. 前記複合語構成配分推定部が、複合語の各構成語の構成語支配度を複合語毎の構成語支配度の総和で除すことで、正規化した重み付け係数を算出する、ことを特徴とする請求項4乃至12のいずれか1項に記載の多義語抽出システム。   The compound word composition distribution estimation unit calculates a normalized weighting coefficient by dividing the component word dominance of each component word of the compound word by the sum of the component word dominance of each compound word. The multi-word extraction system according to any one of claims 4 to 12. 分析対象とする文書毎又は文章群毎に重み付け係数を与えて、任意の基軸単語に関する個々の基軸単語共起語の確度ある共起語概念を一般概念から推定することに使用し、該推定した共起語概念を用いてクラスタを形成して、前記基軸単語を多義語候補とするか判別することを特徴とする請求項1乃至13のいずれか1項に記載の多義語抽出システム。   A weighting factor is given for each document or sentence group to be analyzed, and the estimated co-occurrence word concept of each basic word co-occurrence word for an arbitrary basic word is used to estimate from the general concept. 14. The multi-word extraction system according to claim 1, wherein a cluster is formed using a co-occurrence word concept to determine whether the base word is a multi-word word candidate. 入力を受けた所定の文章に使用されている各単語を抽出処理し、
抽出した単語の内で任意の単語を基軸単語として選択して、該基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルを抽出処理し、
抽出した基軸単語共起ベクトルの各基軸単語共起語の共起語概念を一般概念から推定処理し、
推定した共起語概念群について、対応する共起語概念間の類似性に基づき、前記選択した基軸単語に関する各基軸単語共起語をクラスタリングを行い、
前記選択した基軸単語に関して複数のクラスタが存在した際に該基軸単語を多義語候補として抽出処理する
ことを特徴とする多義語抽出方法。
Extract each word used in the given sentence that received the input,
An arbitrary word is selected as a base word from the extracted words, and a base word co-occurrence vector represented by the base word co-occurrence word and the co-occurrence number regarded as the co-occurrence relation with the base word is extracted. And
Estimating the co-occurrence word concept of each basic word co-occurrence word of the extracted basic word co-occurrence vector from the general concept,
For the estimated co-occurrence word concept group, based on the similarity between the corresponding co-occurrence word concepts, clustering each of the basic word co-occurrence words for the selected basic word,
A polysemy extraction method, wherein when a plurality of clusters exist for the selected base word, the base word is extracted as a multi-word candidate.
入力部から受け付けた文書もしくは文書群を構成する文章に使用されている各単語の抽出および単語毎の品詞や格、組み合される助詞、単語間の係り受け関係に関する単語情報の抽出を行い、
前記単語の内で任意の単語を基軸単語として選択して、基軸単語毎の単語情報に基づき、任意の基軸単語共起判定ルールで基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルをそれぞれ抽出し、
単語の概念分類、同義語、類義語、用法といった単語の一般概念を体系付けた一般概念情報を収集して蓄積すると共に特定の単語に関する問い合わせに対して、単語の意味や用法に関連する一般概念情報を検索し応答する概念データベースから、応答として得られる一般概念情報を利用し、所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定し、
特定の基軸単語に関する各基軸単語共起語について、対応する前記共起語概念間の類似性を所定の類似性指標によって算出し、前記共起語概念間の類似性指標に基づき各基軸単語共起語のクラスタリングを行い、
各基軸単語に関する各基軸単語共起語の各クラスタの規模が任意に定めた閾値以上のクラスタが複数存在する基軸単語を、多義語候補として抽出する
ことを特徴とする多義語抽出方法。
Extraction of each word used in a document received from an input unit or a sentence constituting a document group, and extraction of word information related to part-of-speech and case for each word, particle to be combined, and dependency relation between words,
An arbitrary word is selected as a basic word among the words, and a basic word co-occurrence word that is regarded as a co-occurrence relationship with the basic word in an arbitrary basic word co-occurrence determination rule based on word information for each basic word and its word Extract the core word co-occurrence vector represented by the number of co-occurrence,
Collect and accumulate general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and general concept information related to the meaning and usage of words for inquiries about specific words Using the general concept information obtained as a response from the concept database that retrieves and responds, based on a predetermined concept estimation method, the co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector is estimated,
For each basic word co-occurrence word for a specific basic word, the similarity between the corresponding co-occurrence word concepts is calculated by a predetermined similarity index, and each basic word co-occurrence word is calculated based on the similarity index between the co-occurrence word concepts. Perform clustering of words,
A polysemy extraction method characterized by extracting a base word in which a plurality of clusters having a cluster size of each base word co-occurrence word relating to each base word that are not less than a predetermined threshold are present as a multi-word candidate.
情報処理装置の制御部を、
入力を受けた所定の文章に使用されている各単語の抽出を行う単語分析部と、
前記単語の内で任意の単語を基軸単語として選択し、該基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルを抽出する基軸単語共起ベクトル抽出部と、
基軸単語共起ベクトルの各基軸単語共起語の共起語概念を一般概念から推定する共起語概念推定部と、
推定した共起語概念群について、対応する共起語概念間の類似性に基づき、前記選択した基軸単語に関する各基軸単語共起語のクラスタリングを行う共起語分類部と、
前記選択した基軸単語に関して複数のクラスタが存在した際に該基軸単語を多義語候補として抽出する多義語候補推定部と、
抽出した多義語候補を出力する多義語候補出力部
として動作させることを特徴とするプログラム。
The control unit of the information processing device
A word analysis unit that extracts each word used in a predetermined sentence that has been input;
A base word that selects an arbitrary word from the words as a base word, and extracts a base word co-occurrence vector represented by the base word co-occurrence word and the co-occurrence number considered to be a co-occurrence relationship with the base word A co-occurrence vector extraction unit;
A co-occurrence word concept estimation unit that estimates a co-occurrence word concept of each basic word co-occurrence word of the basic word co-occurrence vector from a general concept;
About the estimated co-occurrence word concept group, based on the similarity between corresponding co-occurrence word concepts, a co-occurrence word classification unit that performs clustering of each of the basic word co-occurrence words related to the selected basic word,
A multiple word candidate estimation unit that extracts the basic word as a multiple word candidate when a plurality of clusters exist with respect to the selected basic word;
A program that is operated as a multiple word candidate output unit that outputs the extracted multiple word candidates.
情報処理装置の制御部を、
対象とする文書もしくは文書群の入力を受け付ける文書入力部と、
文書もしくは文書群を構成する文章に使用されている各単語の抽出および単語毎の品詞や格、組み合される助詞、単語間の係り受け関係に関する単語情報の抽出を行う単語分析部と、
前記単語の内で任意の単語を基軸単語として選択して、基軸単語毎の単語情報に基づき、任意の基軸単語共起判定ルールで基軸単語と共起関係とみなされる基軸単語共起語とその共起数とで表される基軸単語共起ベクトルをそれぞれ抽出する基軸単語共起ベクトル抽出部と、
単語の概念分類、同義語、類義語、用法といった単語の一般概念を体系付けた一般概念情報を収集して蓄積すると共に特定の単語に関する問い合わせに対して、単語の意味や用法に関連する一般概念情報を検索し応答する概念データベースから応答として得られた一般概念情報を利用し、所定の概念推定方法に基づき、基軸単語共起ベクトルの各基軸単語共起語の共起語概念を推定する共起語概念推定部と、
特定の基軸単語に関する各基軸単語共起語について、対応する前記共起語概念間の類似性を所定の類似性指標によって算出し、前記共起語概念間の類似性指標に基づき各基軸単語共起語のクラスタリングを行う共起語分類部と、
各基軸単語に関する各基軸単語共起語の各クラスタの規模が任意に定めた閾値以上のクラスタが複数存在する基軸単語を、多義語候補として抽出する多義語候補推定部と、
抽出した多義語候補を出力する多義語候補出力部
として動作させることを特徴とするプログラム。
The control unit of the information processing device
A document input unit for receiving input of a target document or document group, and
A word analysis unit for extracting each word used in a sentence constituting a document or a document group, and extracting word information about a part of speech and a case for each word, a particle to be combined, and a dependency relation between words;
An arbitrary word is selected as a basic word among the words, and a basic word co-occurrence word that is regarded as a co-occurrence relationship with the basic word in an arbitrary basic word co-occurrence determination rule based on word information for each basic word and its word A basic word co-occurrence vector extracting unit for extracting a basic word co-occurrence vector represented by the co-occurrence number;
Collect and accumulate general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and general concept information related to the meaning and usage of words for inquiries about specific words Co-occurrence that estimates the co-occurrence word concept of each basic word co-occurrence vector of the basic word co-occurrence vector based on a predetermined concept estimation method using general concept information obtained as a response from the concept database that searches and responds A word concept estimator;
For each basic word co-occurrence word for a specific basic word, the similarity between the corresponding co-occurrence word concepts is calculated by a predetermined similarity index, and each basic word co-occurrence word is calculated based on the similarity index between the co-occurrence word concepts. A co-occurrence classifier that clusters words,
A multiple word candidate estimator for extracting, as a multiple word candidate, a basic word in which a plurality of clusters equal to or larger than a threshold that is arbitrarily determined by the size of each cluster of each basic word co-occurrence word for each basic word;
A program that is operated as a multiple word candidate output unit that outputs the extracted multiple word candidates.
JP2011152983A 2011-07-11 2011-07-11 Polysemy extraction system, polysemy extraction method, and program Active JP5754018B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011152983A JP5754018B2 (en) 2011-07-11 2011-07-11 Polysemy extraction system, polysemy extraction method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2011152983A JP5754018B2 (en) 2011-07-11 2011-07-11 Polysemy extraction system, polysemy extraction method, and program

Publications (2)

Publication Number Publication Date
JP2013020431A true JP2013020431A (en) 2013-01-31
JP5754018B2 JP5754018B2 (en) 2015-07-22

Family

ID=47691808

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2011152983A Active JP5754018B2 (en) 2011-07-11 2011-07-11 Polysemy extraction system, polysemy extraction method, and program

Country Status (1)

Country Link
JP (1) JP5754018B2 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101478016B1 (en) * 2013-09-04 2015-01-02 한국과학기술정보연구원 Apparatus and method for information retrieval based on sentence cluster using term co-occurrence
CN106776562A (en) * 2016-12-20 2017-05-31 上海智臻智能网络科技股份有限公司 A kind of keyword extracting method and extraction system
WO2020017006A1 (en) 2018-07-19 2020-01-23 富士通株式会社 Learning method, translation method, learning program, translation program, and information processing device
WO2020021609A1 (en) 2018-07-23 2020-01-30 富士通株式会社 Generation method, generation program, and information processing apparatus
US10643152B2 (en) 2017-03-30 2020-05-05 Fujitsu Limited Learning apparatus and learning method
WO2021098794A1 (en) * 2019-11-21 2021-05-27 邝俊伟 Text search method, device, server, and storage medium
US11144724B2 (en) 2018-03-14 2021-10-12 Fujitsu Limited Clustering of words with multiple meanings based on generating vectors for each meaning
US20220138417A1 (en) * 2019-02-21 2022-05-05 Nippon Telegraph And Telephone Corporation Synonym extraction device, synonym extraction method, and synonym extraction program
WO2022130578A1 (en) * 2020-12-17 2022-06-23 富士通株式会社 Similarity determination program, similarity determination device, and similarity determination method
WO2022130579A1 (en) * 2020-12-17 2022-06-23 富士通株式会社 Similarity determination program, similarity determination device, and similarity determination method
US11514248B2 (en) 2017-06-30 2022-11-29 Fujitsu Limited Non-transitory computer readable recording medium, semantic vector generation method, and semantic vector generation device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105373528B (en) * 2015-08-18 2019-03-12 新华网股份有限公司 A kind of text content sensitive analysis method and device
CN106909537B (en) * 2017-02-07 2020-04-07 中山大学 One-word polysemous analysis method based on topic model and vector space

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6256629B1 (en) * 1998-11-25 2001-07-03 Lucent Technologies Inc. Method and apparatus for measuring the degree of polysemy in polysemous words
JP2005025555A (en) * 2003-07-03 2005-01-27 Ricoh Co Ltd Thesaurus construction system, thesaurus construction method, program for executing the method, and storage medium with the program stored thereon
US20090157648A1 (en) * 2007-12-14 2009-06-18 Richard Michael King Method and Apparatus for Discovering and Classifying Polysemous Word Instances in Web Documents
JP2010182267A (en) * 2009-02-09 2010-08-19 Toshiba Corp Content classification apparatus, content classification method, and program
JP2010225135A (en) * 2009-03-20 2010-10-07 Nec (China) Co Ltd Disambiguation method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6256629B1 (en) * 1998-11-25 2001-07-03 Lucent Technologies Inc. Method and apparatus for measuring the degree of polysemy in polysemous words
JP2005025555A (en) * 2003-07-03 2005-01-27 Ricoh Co Ltd Thesaurus construction system, thesaurus construction method, program for executing the method, and storage medium with the program stored thereon
US20090157648A1 (en) * 2007-12-14 2009-06-18 Richard Michael King Method and Apparatus for Discovering and Classifying Polysemous Word Instances in Web Documents
JP2010182267A (en) * 2009-02-09 2010-08-19 Toshiba Corp Content classification apparatus, content classification method, and program
JP2010225135A (en) * 2009-03-20 2010-10-07 Nec (China) Co Ltd Disambiguation method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CSNG201100237118; 鏑木 雄太 外2名: '"共起語グラフのクラスタリングによる単語の多義性抽出"' 言語処理学会第17回年次大会発表論文集 チュートリアル 本会議 ワークショップ [CD-ROM] , 20110307, p.508-511, 言語処理学会 *
CSNJ200810054156; 片岡 浩一 外1名: '"未知語動詞の語義推定における多義性の検出"' 第62回(平成13年前期)全国大会講演論文集(2) 人工知能と認知科学 , 20010313, p.2-321〜2-322, 社団法人情報処理学会 *
JPN6015001849; 鏑木 雄太 外2名: '"共起語グラフのクラスタリングによる単語の多義性抽出"' 言語処理学会第17回年次大会発表論文集 チュートリアル 本会議 ワークショップ [CD-ROM] , 20110307, p.508-511, 言語処理学会 *
JPN6015001850; 片岡 浩一 外1名: '"未知語動詞の語義推定における多義性の検出"' 第62回(平成13年前期)全国大会講演論文集(2) 人工知能と認知科学 , 20010313, p.2-321〜2-322, 社団法人情報処理学会 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101478016B1 (en) * 2013-09-04 2015-01-02 한국과학기술정보연구원 Apparatus and method for information retrieval based on sentence cluster using term co-occurrence
CN106776562A (en) * 2016-12-20 2017-05-31 上海智臻智能网络科技股份有限公司 A kind of keyword extracting method and extraction system
US10643152B2 (en) 2017-03-30 2020-05-05 Fujitsu Limited Learning apparatus and learning method
US11514248B2 (en) 2017-06-30 2022-11-29 Fujitsu Limited Non-transitory computer readable recording medium, semantic vector generation method, and semantic vector generation device
US11144724B2 (en) 2018-03-14 2021-10-12 Fujitsu Limited Clustering of words with multiple meanings based on generating vectors for each meaning
WO2020017006A1 (en) 2018-07-19 2020-01-23 富士通株式会社 Learning method, translation method, learning program, translation program, and information processing device
WO2020021609A1 (en) 2018-07-23 2020-01-30 富士通株式会社 Generation method, generation program, and information processing apparatus
US20220138417A1 (en) * 2019-02-21 2022-05-05 Nippon Telegraph And Telephone Corporation Synonym extraction device, synonym extraction method, and synonym extraction program
US11900055B2 (en) * 2019-02-21 2024-02-13 Nippon Telegraph And Telephone Corporation Synonym extraction device, synonym extraction method, and synonym extraction program
WO2021098794A1 (en) * 2019-11-21 2021-05-27 邝俊伟 Text search method, device, server, and storage medium
WO2022130578A1 (en) * 2020-12-17 2022-06-23 富士通株式会社 Similarity determination program, similarity determination device, and similarity determination method
WO2022130579A1 (en) * 2020-12-17 2022-06-23 富士通株式会社 Similarity determination program, similarity determination device, and similarity determination method

Also Published As

Publication number Publication date
JP5754018B2 (en) 2015-07-22

Similar Documents

Publication Publication Date Title
JP5754018B2 (en) Polysemy extraction system, polysemy extraction method, and program
CN109284357B (en) Man-machine conversation method, device, electronic equipment and computer readable medium
Gambhir et al. Recent automatic text summarization techniques: a survey
JP5754019B2 (en) Synonym extraction system, method and program
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
Biemann et al. Text: Now in 2D! a framework for lexical expansion with contextual similarity
JP6187877B2 (en) Synonym extraction system, method and recording medium
JP2011227688A (en) Method and device for extracting relation between two entities in text corpus
JP5057474B2 (en) Method and system for calculating competition index between objects
Sarwadnya et al. Marathi extractive text summarizer using graph based model
JPWO2014002775A1 (en) Synonym extraction system, method and recording medium
Ojokoh et al. A feature-opinion extraction approach to opinion mining
JP4534666B2 (en) Text sentence search device and text sentence search program
CN113157860B (en) Electric power equipment maintenance knowledge graph construction method based on small-scale data
Litvak et al. Cross-lingual training of summarization systems using annotated corpora in a foreign language
JP2006338342A (en) Word vector generation device, word vector generation method and program
JP6108212B2 (en) Synonym extraction system, method and program
Anam et al. Review of ontology matching approaches and challenges
WO2014002774A1 (en) Synonym extraction system, method, and recording medium
JP7110554B2 (en) Ontology generation device, ontology generation program and ontology generation method
CN110020436A (en) A kind of microblog emotional analytic approach of ontology and the interdependent combination of syntax
JP5720071B2 (en) Compound word concept analysis system, method and program
JP2004272352A (en) Similarity calculation method, similarity calculation device, similarity calculation program, and recording medium stored with the program
Kutuzov et al. Neural embedding language models in semantic clustering of web search results
Wang et al. A semantic path based approach to match subgraphs from large financial knowledge graph

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20140709

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20140709

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20141216

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20150121

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20150311

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20150408

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20150428

R150 Certificate of patent or registration of utility model

Ref document number: 5754018

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

S533 Written request for registration of change of name

Free format text: JAPANESE INTERMEDIATE CODE: R313533

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250