JPWO2014002775A1

JPWO2014002775A1 - Synonym extraction system, method and recording medium

Info

Publication number: JPWO2014002775A1
Application number: JP2014522531A
Authority: JP
Inventors: 智久五藤; 英司平尾; 古橋　武; 武古橋; 大弘吉川
Original assignee: Nagoya University NUC; NEC Corp; Tokai National Higher Education and Research System NUC
Current assignee: Nagoya University NUC; NEC Corp; Tokai National Higher Education and Research System NUC
Priority date: 2012-06-25
Filing date: 2013-06-06
Publication date: 2016-05-30
Also published as: WO2014002775A1

Abstract

情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語のある文書の曖昧さを改善するために、同義語抽出システムは、分析対象である文書の入力を受け付ける文書入力部と、文書中の各文および複合語に形態素解析および構文解析を適用し、各単語の品詞や係り受け関係を抽出する単語分析部と、各単語の共起語による共起語ベクトルを抽出する共起情報抽出部と、共起語ベクトルに追加すべき攪乱値を、単語ペアの出現数あるいは出現頻度、文書量、文書中の単語のバリエーション数の少なくとも一つから算出する攪乱値算出部と、攪乱値を用いた補正コサイン類似度に基づき、同義関係を持つ単語ペア候補を同義語として推定する同義候補推定部と、推定した同義候補語を提示する同義候補出力部と、を少なくとも具備する。In order to improve the ambiguity of documents with synonyms that only exist in a specific group of documents, such as proposals and specifications related to information system construction, the synonym extraction system inputs the documents to be analyzed. A document input unit that accepts words, a word analysis unit that applies morphological analysis and syntactic analysis to each sentence and compound word in the document, and extracts part-of-speech and dependency relationships of each word, and co-occurrence by co-occurrence words of each word A co-occurrence information extraction unit that extracts a word vector and a disturbance value to be added to the co-occurrence word vector are calculated from at least one of the number or frequency of word pairs, the amount of documents, and the number of variations of words in the document. Based on the disturbance value calculation unit, the corrected cosine similarity using the disturbance value, a synonym candidate estimation unit that estimates a word pair candidate having a synonym relationship as a synonym, and the estimated synonym candidate word are presented Righteousness candidate output unit, characterized by at least.

Description

本発明は、同義語抽出システム、方法および記録媒体に関し、特に、情報システム構築に関する提案書や仕様書等といった、特定の案件に関する文書群でのみ成り立つ同義語のある文書から、同義語を抽出する同義語抽出システム、方法および記録媒体に関する。 The present invention relates to a synonym extraction system, method, and recording medium, and in particular, extracts synonyms from documents that have synonyms that can be established only in a document group related to a specific item, such as proposals and specifications related to information system construction. The present invention relates to a synonym extraction system, method, and recording medium.

システムやソフトウェアを構築する際の上流工程の開発文書には、顧客からの提案依頼書（ＲｅｑｕｅｓｔＦｏｒＰｒｏｐｏｓａｌ：ＲＦＰ）、顧客への提案書、顧客と合意すべき要件定義書、および各種仕様書（基本仕様書、機能仕様書、詳細仕様書等）がある。これらの開発文書は、下流工程で行うプログラム実装の設計書といえる。
これらの上流工程の開発文書の誤りは下流のプログラムで拡散していく。拡散した全ての誤りを他のプログラムに影響を与えずに修正するためには、多大な工数が必要となる。この上流工程の開発文書中の誤りの一つとして同義語がある。尚、同義語とは、意義は同じで語形が異なっている語、換言すれば、発音や表記は異なるが、意味の同じである語をいう。
この同義語を検出する方法としては、プロジェクト全体を理解しているプロジェクトマネージャーによるインスペクション（レビュー）が有効であるが、人的リソースが限られている場合は、その運用は困難といえる。一方、この問題点を、ツールを用いて支援しようという開示技術が報告されている。
同義語抽出システムに関する先行技術の一例が、特許文献１に「類似表現抽出装置」として記載されている。この特許文献１に開示された類似表現抽出装置は、データ記憶部、単語グループ記憶部、シソーラス記憶部、文書入力部、単語グループ作成処理部、および評価調整処理部から構成されている。このような構成を有する類似表現抽出装置は、次のように動作する。
すなわち、文書入力部は、入力インタフェースとして電子文書の入力を受け付ける。
単語グループ作成処理部は、次に述べるような処理を行う。まず、単語グループ作成処理部は、前記文書入力部で入力された電子文書内の文を形態素解析し、得られた形態素解析結果を前記データ記憶部に書き込む。そして、単語グループ作成処理部は、前記データ記憶部内の形態素解析結果を構文解析し、構文解析結果として得られた文脈情報を前記データ記憶部に書き込む。単語グループ作成処理部は、前記データ記憶部内の文脈情報から２文節の係り受けの組を含む共起表現を抽出し、この共起表現を前記データ記憶部に書き込む。単語グループ作成処理部は、前記データ記憶部内の共起表現のうち、所定の品詞の組合せの２文節からなる共起表現に基づいて、この共起表現における一方の単語毎に、他方の単語との共起頻度と、前記電子文書内の単語との共起頻度とからなる単語属性値を算出する。単語グループ作成処理部は、前記単語属性値を前記一方の単語に関連付けることにより、当該単語毎に単語ベクトルを作成し、この単語ベクトルを前記データ記憶部に書き込む。単語グループ作成処理部は、前記データ記憶部内の各単語ベクトル間の単語類似度を計算し、得られた単語類似度を、当該計算に用いた各単語ベクトルに関連付けて前記データ記憶部に書き込む。単語グループ作成処理部は、前記データ記憶部内の単語類似度に基づいて、教師なし学習手法により、前記単語類似度の算出に用いた各単語ベクトルが示す各単語を同一の単語グループに分類し、当該分類された各単語を含む単語グループを前記単語グループ記憶部に書き込む。
さらに、評価調整処理部は、次に述べるような処理を行う。まず、評価調整処理部は、前記シソーラス記憶部内のシソーラス情報に含まれる表現のうち、前記入力された電子文書に含まれる表現を学習データとして生成する。そして、評価調整処理部は、前記生成された学習データに基づいて当該学習データ間の類似度を計算し、この類似度により学習データを含む学習データグループを作成する。評価調整処理部は、前記学習データグループの個数に対し、前記単語グループ記憶部内の単語グループの個数を一致させるように、当該単語グループを統合する。評価調整処理部は、前記統合された単語グループ毎に、前記学習データグループ内の学習データを含む度合を示す大域評価値を計算し、この大域評価値を前記データ記憶部に書き込む。評価調整処理部は、前記統合された単語グループ毎に、単語グループ内の各単語を示す各単語ベクトルに関連する単語類似度の分散を計算し、得られた分散を局所評価値として前記データ記憶部に書き込む。評価調整処理部は、前記大域評価値及び前記局所評価値に基づいて、これら両評価値の和を上限値にするように、前記データ記憶部内の単語グループの境界を調整する。評価調整処理部は、前記調整された単語グループ内の各単語を前記類似表現として抽出し、当該抽出した類似表現の各単語を出力する。
なお、データ記憶部は、単語グループ作成処理部、評価調整処理部から読出／書込可能な記憶装置であり、処理前後のデータ等が適宜記憶される。単語グループ記憶部は、単語グループ作成処理部、評価調整処理部から読出／書込可能な記憶装置であり、類似表現の各単語からなる単語グループが記憶される。シソーラス記憶部は、評価調整処理部から読出／書込可能な記憶装置であり、予めシソーラス情報が記憶されている。
このような構成により、文書中の単語について、共起の頻度による単語類似度に基づく単語グループと、シソーラスでの距離などに基づく学習データグループを作成し、学習データグループの個数と構成単語に単語グループの個数および構成単語を一致させるように単語グループの境界を調整することで類似表現の各単語を抽出している。
一方、同義語抽出システムに関する先行技術の他の例が、特許文献２に「辞書生成装置」として記載されている。この特許文献２に開示された辞書生成装置は、入力部、単語分割部、共起頻度表生成部、シソーラス頻度表変換部、頻度表統合部、および関連性学習部から構成されている。このような構成を有する辞書生成装置は、次のように動作する。
入力部は学習用の文書の入力を受け付ける。次に、単語分割部は、入力した文書中のテキストを単語に分割する。さらに、共起頻度表生成部は、文書中の所定の範囲内に出現する単語の頻度統計を収集する。シソーラス頻度表変換部は、辞書の類義関連性をカスタマイズするためのシソーラス情報を仮想的な頻度表に変換する。頻度表統合部は、上記共起頻度表と仮想頻度表を統合する。関連性学習部は、共起頻度表をもとに単語間の関連性を学習し、共起頻度表を圧縮して概念辞書を作成する。
このような構成により、辞書の類義関連性をカスタマイズするためのシソーラス情報を仮想的な頻度表に変換することにより、共起頻度表に存在しない単語の頻度情報を補完し、関連性学習処理を行うことで、元の単語量での共起頻度表では取得できなかった潜在的な関連性を辞書に取り込むことを実現している。Development documents for upstream processes when building systems and software include request for proposal (RFP) from customers, proposals to customers, requirement definitions to be agreed with customers, and various specifications ( Basic specifications, functional specifications, detailed specifications, etc.). These development documents can be said to be design documents for program implementation performed in the downstream process.
Errors in these upstream process development documents are spread by downstream programs. In order to correct all diffused errors without affecting other programs, a great amount of man-hours is required. One of the errors in this upstream process development document is a synonym. A synonym is a word that has the same significance but a different word form, in other words, a word that has the same meaning but different pronunciation and notation.
As a method of detecting this synonym, inspection (review) by a project manager who understands the entire project is effective, but it can be said that its operation is difficult when human resources are limited. On the other hand, a disclosed technique for supporting this problem using a tool has been reported.
An example of prior art related to a synonym extraction system is described in Patent Document 1 as a “similar expression extraction device”. The similar expression extraction device disclosed in Patent Document 1 includes a data storage unit, a word group storage unit, a thesaurus storage unit, a document input unit, a word group creation processing unit, and an evaluation adjustment processing unit. The similar expression extraction device having such a configuration operates as follows.
That is, the document input unit accepts input of an electronic document as an input interface.
The word group creation processing unit performs the following processing. First, the word group creation processing unit performs morphological analysis on the sentence in the electronic document input by the document input unit, and writes the obtained morpheme analysis result in the data storage unit. Then, the word group creation processing unit parses the morpheme analysis result in the data storage unit and writes the context information obtained as the syntax analysis result to the data storage unit. The word group creation processing unit extracts a co-occurrence expression including a dependency set of two phrases from the context information in the data storage unit, and writes the co-occurrence expression in the data storage unit. The word group creation processing unit, for each word in the co-occurrence expression, based on the co-occurrence expression consisting of two phrases of a combination of predetermined parts of speech among the co-occurrence expressions in the data storage unit, And a word attribute value comprising the co-occurrence frequency of the word in the electronic document. The word group creation processing unit creates a word vector for each word by associating the word attribute value with the one word, and writes the word vector into the data storage unit. The word group creation processing unit calculates the word similarity between the word vectors in the data storage unit, and writes the obtained word similarity to the data storage unit in association with each word vector used for the calculation. The word group creation processing unit classifies each word indicated by each word vector used for calculation of the word similarity into the same word group by an unsupervised learning method based on the word similarity in the data storage unit, A word group including each classified word is written in the word group storage unit.
Further, the evaluation adjustment processing unit performs the following processing. First, the evaluation adjustment processing unit generates, as learning data, an expression included in the input electronic document among expressions included in the thesaurus information in the thesaurus storage unit. The evaluation adjustment processing unit calculates a similarity between the learning data based on the generated learning data, and creates a learning data group including the learning data based on the similarity. The evaluation adjustment processing unit integrates the word groups so that the number of word groups in the word group storage unit matches the number of learning data groups. The evaluation adjustment processing unit calculates a global evaluation value indicating a degree including the learning data in the learning data group for each of the integrated word groups, and writes the global evaluation value in the data storage unit. The evaluation adjustment processing unit calculates a variance of word similarity related to each word vector indicating each word in the word group for each of the integrated word groups, and stores the obtained variance as a local evaluation value in the data storage Write to the department. The evaluation adjustment processing unit adjusts the boundary of the word group in the data storage unit based on the global evaluation value and the local evaluation value so that the sum of both evaluation values becomes an upper limit value. The evaluation adjustment processing unit extracts each word in the adjusted word group as the similar expression, and outputs each word of the extracted similar expression.
The data storage unit is a storage device that can be read / written from the word group creation processing unit and the evaluation adjustment processing unit, and stores data before and after processing as appropriate. The word group storage unit is a storage device that can be read / written from the word group creation processing unit and the evaluation adjustment processing unit, and stores a word group composed of words of similar expressions. The thesaurus storage unit is a storage device that can be read / written from the evaluation adjustment processing unit, and stores thesaurus information in advance.
With such a configuration, a word group based on word similarity based on the frequency of co-occurrence and a learning data group based on the distance in the thesaurus are created for the words in the document. Words of similar expressions are extracted by adjusting the boundaries of word groups so that the number of groups and the constituent words match.
On the other hand, another example of the prior art relating to a synonym extraction system is described in Patent Document 2 as a “dictionary generation device”. The dictionary generation device disclosed in Patent Document 2 includes an input unit, a word division unit, a co-occurrence frequency table generation unit, a thesaurus frequency table conversion unit, a frequency table integration unit, and an association learning unit. The dictionary generation device having such a configuration operates as follows.
The input unit accepts input of a learning document. Next, the word dividing unit divides the text in the input document into words. Further, the co-occurrence frequency table generation unit collects frequency statistics of words appearing within a predetermined range in the document. The thesaurus frequency table conversion unit converts thesaurus information for customizing the synonym relation of the dictionary into a virtual frequency table. The frequency table integration unit integrates the co-occurrence frequency table and the virtual frequency table. The relevance learning unit learns the relevance between words based on the co-occurrence frequency table and compresses the co-occurrence frequency table to create a concept dictionary.
With such a configuration, thesaurus information for customizing the synonym relevance of the dictionary is converted into a virtual frequency table, so that the frequency information of words that do not exist in the co-occurrence frequency table is complemented, and the relevance learning process As a result, the potential relevance that could not be obtained in the co-occurrence frequency table with the original word amount is incorporated into the dictionary.

特開２０１０−１５２５６１号公報JP 2010-152561 A 特開２００５−２５０７６２号公報JP 2005-250762 A

このような先行技術の第一の課題は、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語のある文書の同義語の抽出に、先行技術による同義語の抽出方法を適用すると、同義語の抽出率が低くなってしまうことである。その理由は、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語のある文書の多くは、文章量が限られているため任意の単語に対する共起語として同一の単語が出現する可能性が低く、特許文献１の従来手法で用いられているような共起語の類似性で単語の類似判定を行うことが難しいためである。
先行技術の第二の課題は、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語のある文書の同義語の抽出に、先行技術による同義語の抽出方法を適用すると、特定の案件に関する文書群でのみ成り立つ同義語を抽出することができないことである。その理由は、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語は、事前にその同義関係を把握することが難しく、特許文献２の従来手法で用いられているようなカスタマイズされたシソーラスを準備することが困難であるためである。
すなわち、本発明の目的は、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語のある文書から、特定の案件に関する文書群でのみ成り立つ同義語を抽出する、同義語抽出システム、方法および記録媒体を提供することにある。The first problem of such prior art is the extraction of synonyms for documents with synonyms that only exist in a group of documents related to a specific project, such as proposals and specifications related to information system construction. When the synonym extraction method is applied, the synonym extraction rate is lowered. The reason for this is that many of the documents with synonyms that consist only of documents related to a specific project, such as proposals and specifications related to information system construction, have a limited amount of sentences, so co-occurrence for any word This is because the possibility that the same word appears as a word is low, and it is difficult to determine the similarity of words based on the similarity of co-occurrence words as used in the conventional method of Patent Document 1.
The second problem of the prior art is the extraction of synonyms for documents with synonyms that only exist in a group of documents related to a specific project, such as proposals and specifications for information system construction. When the extraction method is applied, it is impossible to extract synonyms that can be established only in a document group related to a specific case. The reason for this is that it is difficult to grasp the synonym relations in advance for the synonyms that consist only of documents related to a specific project, such as proposals and specifications related to information system construction. This is because it is difficult to prepare a customized thesaurus as used.
That is, the object of the present invention is to obtain a synonym that only exists in a document group related to a specific case from a document that has a synonym that holds only in a document group related to a specific case, such as a proposal or specification related to information system construction. An object of the present invention is to provide a synonym extraction system, method, and recording medium for extraction.

本発明者らが鋭意検討した結果、分析対象である文書を分析して作成された各単語の共起語ベクトルあるいは概念語ベクトルに、その分析対象文書における単語ペアの出現数あるいは出現頻度、文書量、文書中の単語のバリエーション数の少なくとも一つから直行ベクトルからなる攪乱値を導入して補正コサイン類似度計算を行うことで、上述の課題を解決できることを見出した。
具体的には、本発明の第１の態様による同義語抽出システムは、分析対象である文書の入力を受け付ける文書入力部と；文書中の各文および複合語に形態素解析および構文解析を適用し、各単語の品詞や係り受け関係を抽出する単語分析部と；各単語の共起語ベクトルを、単語間の文脈情報として抽出する共起情報抽出部と；共起語ベクトルに追加すべき攪乱値を、単語ペアの出現数あるいは出現頻度、文書量、文書中の単語のバリエーション数の少なくとも一つから算出する攪乱値算出部と；攪乱値を用いた補正コサイン類似度に基づき、同義関係を持つ単語ペア候補を同義語候補として推定する同義語候補推定部と；推定した同義語候補を提示する同義語候補出力部と；を少なくとも具備する。
また、本発明の第２の態様による同義語抽出システムは、分析対象である文書の入力を受け付ける文書入力部と；文書中の各文および複合語に形態素解析および構文解析を適用し、各単語の品詞や係り受け関係を抽出する単語分析部と；各単語の共起語ベクトルを集約した概念ベクトルを、単語間の文脈情報として抽出する文脈情報抽出部と；概念ベクトルに追加すべき攪乱値を、単語ペアの出現数あるいは出現頻度、文書量、文書中の単語のバリエーション数、共起語の概念集約階層の少なくとも一つから算出する攪乱値算出部と；攪乱値を用いた補正コサイン類似度に基づき、同義関係を持つ単語ペア候補を同義語候補として推定する同義語候補推定部と；推定した同義語候補を提示する同義語候補出力部と；を少なくとも具備する。As a result of intensive studies by the inventors, the co-occurrence word vector or concept word vector of each word created by analyzing the document to be analyzed, the number of occurrences or frequency of occurrence of word pairs in the document to be analyzed, document It has been found that the above-mentioned problem can be solved by introducing a disturbance value consisting of an orthogonal vector from at least one of the quantity and the number of word variations in the document and calculating a corrected cosine similarity.
Specifically, the synonym extraction system according to the first aspect of the present invention includes a document input unit that receives input of a document to be analyzed; and applies morphological analysis and syntax analysis to each sentence and compound word in the document. A word analysis unit that extracts parts of speech and dependency relationships of each word; a co-occurrence information extraction unit that extracts a co-occurrence word vector of each word as context information between words; and a disturbance to be added to the co-occurrence word vector A disturbance value calculation unit that calculates a value from at least one of the number or frequency of occurrences of word pairs, the amount of documents, and the number of variations of words in the document; and a synonym relationship based on the corrected cosine similarity using the disturbance values At least a synonym candidate estimation unit that estimates a word pair candidate possessed as a synonym candidate; and a synonym candidate output unit that presents the estimated synonym candidate.
The synonym extraction system according to the second aspect of the present invention includes a document input unit that receives input of a document to be analyzed; and applies morphological analysis and syntax analysis to each sentence and compound word in the document, and each word A word analysis unit that extracts part-of-speech parts and dependency relationships; a context information extraction unit that extracts a concept vector that aggregates co-occurrence word vectors of each word as context information between words; and a disturbance value to be added to the concept vector A disturbance value calculation unit that calculates the number of occurrences of word pairs from the number of occurrences or frequency of occurrence, the amount of documents, the number of variations of words in the document, and the concept aggregation hierarchy of co-occurrence words; and correction cosine similarity using disturbance values At least a synonym candidate estimation unit that estimates a word pair candidate having a synonym relationship as a synonym candidate; and a synonym candidate output unit that presents the estimated synonym candidate.

本発明によれば、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語を、その特定の案件に関する文書群から抽出し、それを提示することが可能となる。 According to the present invention, it is possible to extract a synonym that is formed only in a document group related to a specific case, such as a proposal or specification related to information system construction, from the document group related to the specific case and present it. It becomes possible.

図１は本発明の第１の実施形態に係る同義語抽出システムの構成を示すブロック図である。
図２は図１に示した同義語抽出システムの動作例を示すシーケンス図である。
図３は本発明の第２の実施形態に係る同義語抽出システムの構成を示すブロック図である。
図４は図３に示した同義語抽出システムの動作例を示すシーケンス図である。
図５は本発明の第１の実施例に係る同義語抽出システムにおいて使用される、基軸単語と共起語の共起数の関係を示す図である。
図６は本発明の第１の実施例に係る同義語抽出システムにおいて使用される、基軸単語と類似度ランキングの関係を示す図である。
図７は本発明の第２の実施例に係る同義語抽出システムにおいて使用される、基軸単語と共起語の関係を示す図である。
図８は本発明の第２の実施例に係る同義語抽出システムにおいて使用される、基軸単語における概念と共起語の共起数と関係を示す図である。
図９は本発明の第２の実施例に係る同義語抽出システムにおいて使用される、基軸単語と共起語の共起数の関係を示す図である。
図１０は本発明の第２の実施例に係る同義語抽出システムにおいて使用される、基軸単語と類似度ランキングの関係を示す図である。FIG. 1 is a block diagram showing the configuration of the synonym extraction system according to the first embodiment of the present invention.
FIG. 2 is a sequence diagram showing an operation example of the synonym extraction system shown in FIG.
FIG. 3 is a block diagram showing the configuration of the synonym extraction system according to the second embodiment of the present invention.
4 is a sequence diagram showing an operation example of the synonym extraction system shown in FIG.
FIG. 5 is a diagram illustrating the relationship between the number of co-occurrence of a basic word and a co-occurrence word used in the synonym extraction system according to the first embodiment of the present invention.
FIG. 6 is a diagram showing the relationship between the base word and the similarity ranking used in the synonym extraction system according to the first embodiment of the present invention.
FIG. 7 is a diagram showing the relationship between the basic word and the co-occurrence word used in the synonym extraction system according to the second embodiment of the present invention.
FIG. 8 is a diagram showing the relationship between the concept in the basic word and the number of co-occurrence of co-occurrence words used in the synonym extraction system according to the second embodiment of the present invention.
FIG. 9 is a diagram showing the relationship between the number of co-occurrence of a basic word and a co-occurrence word used in the synonym extraction system according to the second embodiment of the present invention.
FIG. 10 is a diagram showing the relationship between the base word and the similarity ranking used in the synonym extraction system according to the second embodiment of the present invention.

次に、本発明に係わる２つの実施の形態について、図面を参照して詳細に説明する。なお、本発明はこれらの実施の形態に限定されるものではない。
［実施形態１］
図１は、本発明の第１の実施形態に係る同義語抽出システム２００の構成を示すブロック図である。
図１を参照すると、本発明の第１の実施形態に係る同義語抽出システム２００は、基本的に電子機器内もしくはサーバと電子機器およびこれらを相互に接続するインターネット等の情報通信ネットワークからなるシステム内に、少なくとも、文書入力部１０、単語分析部２０、共起情報抽出部３０、攪乱値算出部４０、同義語候補推定部５０、同義語候補出力部６０、及び単語データベース１００を含む。
文書入力部１０は、分析対象である文書の入力を受け付ける。単語分析部２０は、文書中の各文および複合語に形態素解析および構文解析を適用し、各単語の品詞や係り受け関係を抽出する。共起情報抽出部３０は、各単語の共起語による共起語ベクトルを、単語間の文脈情報として抽出する。攪乱値算出部４０は、共起語ベクトルに追加すべき単語毎に攪乱値ｋを、単語ペアの出現数あるいは出現頻度と、文書量、文書中の単語のバリエーション数とから算出する。同義語候補推定部５０は、攪乱値を用いた補正コサイン類似度に基づき、同義関係を持つ単語ペア候補を同義語候補として推定する。同義語候補出力部６０は、推定した同義語候補を提示する。
単語データベース１００は、単語の品詞や構文などの情報を収集して蓄積し、単語分析部２０からの特定の単語に関する問い合わせに対し、単語の品詞や構文に関連する情報を検索し応答する、データベースである。なお、単語データベース１００として、インターネット上のデータベースを使用することとしてもよい。
なお、図示の同義語抽出システム２００は、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語のある文書から、同義語を抽出するのに特に有効な同義語抽出システムである。
本第１の実施形態に係る同義語抽出システムを電子デバイスで構成する場合には、同義語抽出システム２００を、プログラム制御により動作するコンピュータで実現することが可能である。コンピュータは、周知のように、データを入力する入力装置と、データ処理装置と、データ処理装置での処理結果を出力する出力装置と、種々のデータベースとして働く補助記憶装置と、を具備するものである。そして、そのデータ処理装置は、プログラムを記憶するリードオンリメモリ（ＲＯＭ）と、データを一時的に記憶するワークエリアとして使用されるランダムアクセスメモリ（ＲＡＭ）と、ＲＯＭに記憶されたプログラムに従いＲＡＭに記憶されているデータを処理する中央処理装置（ＣＰＵ）と、を含み構成される。
この場合、入力装置が文書入力部１０として働く。データ処理装置が、単語分析部２０、共起情報抽出部３０、攪乱値算出部４０、および同義語候補推定部５０として働く。補助記憶装置が単語データベース１００として動作する。出力装置が同義語候補出力部６０として働く。
次に、同義語抽出システム２００を構成する各構成要素を詳細に説明する。
文書入力部１０は、分析対象とする文書もしくは文書群の登録（入力）を受け付ける。
本発明の実施形態における文書あるいは文書群とは、自然言語で記載された特定の文書あるいは文書群を示し、その一つとして、情報システム構築の際に顧客から提出される提案依頼書（ＲＦＰ）やベンダーが顧客に提出する提案書や要件定義書、さらには、基本設計書、機能設計書、テスト仕様書等があるが、これに限定されるものではない。
単語分析部２０は、文書もしくは文書群を構成する各文章に形態素解析や構文解析を適用することで、各文章に使用されている全単語の抽出および単語毎の品詞や格、組み合される助詞、単語間の係り受け関係に関する単語情報の抽出を行う。
ここで、単語は名詞、動詞、形容詞など単独で意味をなす自立語に限定しても良い。上記単語情報には必要に応じて単語間の係り受け関係などを含めても良い。
具体的には、単語分析部２０は、単語データベース１００に単語情報を問い合わせ、文書もしくは文書群を構成する各文章に形態素解析や構文解析を適用することで、各文章に使用されている全単語の抽出および単語毎の品詞などの単語情報の抽出を行うことができる。
共起情報抽出部３０は、各単語の共起語による共起語ベクトルを、単語間の文脈情報として抽出する。共起語ベクトルの抽出としては、具体的には、単語分析部２０で抽出された各文章に使用されている任意の単語を基軸単語として選択し、基軸単語毎の単語情報に基づき、任意の基軸単語共起判定ルールで基軸単語と共起関係とみなされる共起語とその共起数とで表される基軸単語共起語ベクトルを全基軸単語についてまとめた基軸単語共起表を作成する。
ここで、上記基軸単語共起判定ルールとしては、１文、１段落内の全文章、目次上の同一項目内での全文章、文書全体など、文書の特徴に合わせて共起語と見なす範囲を設定して良く、１文内での共起する動詞、および目次上の同一項目内の文章内の名詞のように品詞毎に共起とみなす範囲を変えても良い。さらに、単語情報に単語間の係り受け関係が含まれる場合は、係り受け関係のある単語かどうかを上記基軸単語共起判定ルールとして利用しても良い。また、共起数は共起回数でも良いが、共起回数を基軸単語毎の全共起語数で除した頻度などでも良い。
また、上記基軸単語共起表とは、各行が各基軸単語に、各列が各共起語に対応している行列で、基軸単語に対する共起語の共起数が表の各値として登録されたものである。なお、基軸単語は相互的なもので、先に基軸単語として選択された単語であっても、後に他の単語を基軸単語とみなす場合は共起語として扱うことができる。
攪乱値算出部４０は、共起情報抽出部３０にて抽出した基軸単語共起語ベクトルに直交ベクトルを追加し、共起数が少ない単語ペアに対する補正を行う。この直交ベクトル中のｋの値を攪乱値とする。下記の数式１は、その補正の一例である。

ここで、攪乱値ｋは、単語ペアの出現数や出現頻度、あるいは登録文書の量（文字数）、あるいは登録文書中の単語のバリエーション数の少なくとも一つを利用して単語ペア候補毎あるいは文書全体の単語ペア候補に対して導出される。
攪乱値ｋの導出方法としては、登録文書の量（文字数）もしくは登録文書中の単語のバリエーション数と単調増加の関係にある関数で得られる文書全体の単語ペア候補に関する基礎攪乱値ｋ０と、任意の単語ペア候補の出現数の和もしくは大きい方の出現数と単調増加の関係にある関数で得られるペア攪乱係数αの積などが考えられる。
なお、単語ペアの出現数や出現頻度のように、単語ペア候補毎に攪乱値ｋを設定することで、基軸単語共起語ベクトルが同一あるいは所定の判定基準によって、類似度が同一となった単語ペア候補間においても、その類似性の大小の判定を的確に行うことができる。これは、単語ペアの出現数や出現頻度が少ない場合や、文書量が少ない場合は、共起語の数が少なくなり、偶発的に同じ共起語が用いられた単語ペア間で高い類似度となる可能性があるが、偶発に左右されない共起語の量によって導かれた類似度を優先することができるためである。
同義語候補推定部５０は、攪乱値算出部４０にて算出された攪乱値ｋをコサイン類似度の計算式に導入した補正コサイン類似度（数式２）に基づき、同義関係を持つ単語ペア候補の類似性を判定する。下記の数２中のベクトルａおよびｂは、単語ペア候補の基軸単語共起語ベクトルを示す。なお、攪乱値ｋを導入した基軸単語共起語ベクトルの列数は４以上であることが望ましいがこの限りではない。また、列数の上限は特にない。

同義語候補出力部６０は、同義語候補推定部５０で推定した同義語候補を出力する。
ここで、出力形態は、文書内における同義語候補の組合せを色分けや太字による強調などで明示することで、文書全体を出力する形態などが適当である。他にも、出力形態としては、同義語候補の組合せを抽出した表などの形態であって良い。その際、各抽出条件における単語ペアの類似性を示すランキング表のランキングトップのみを表示する方法や、各抽出条件を総和した結果を表示することも可能である。その他、同義語候補とされた基軸単語を主ノード、その共起語を中間ノード、概念を端ノードとして関係をリンクで結んだグラフを表示し、同義語候補とされた基軸単語を最短で繋ぐリンクを色分けして強調するなどの形態であっても構わない。さらに、同義語候補を抽出する際に用いた非類似度などで同義語間に定量的な同義度を付加し、同義度が任意に設定された閾値より大きい同義語のみに表示を限定しても良いし、同義語候補間の同義度によって色分けや太字による強調もしくはグラフの単語の文字の大きさなどに強弱を与えるなどしても構わない。
また、各出力形態を選択できるように、ベースとなる表示形態から必要に応じて表やグラフに移行できるようにしてもよい。また、必要に応じて動詞や名詞などを選択的に出力するようにしてもよい。
次に、図１及び図２のシーケンス図を参照して、本発明の第１の実施形態に係る同義語抽出システム２００の全体の動作について詳細に説明する。なお、図２に示すシーケンス図及び以下の説明は処理例であり、適宜求める処理に応じて処理順等を入れ替えたり処理を戻したり繰り返したりすることを行ってもよい。
文書入力部１０は、対象とする文書もしくは文書群の入力を受け付ける（ステップＡ１）。
単語分析部２０は、文書もしくは文書群を構成する各文章に形態素解析や構文解析を適用することで、各文章に使用されている全単語の抽出および単語毎の品詞や格、組み合わされる助詞、単語間の係り受け関係に関する単語情報の抽出を行う（ステップＡ２）。
この際、単語データベース１００は、単語の品詞や構文などの情報を収集して蓄積し、特定の単語に関する問い合わせに対し、単語の品詞や構文に関連する情報を検索し応答する（ステップＡ３）。
共起情報抽出部３０は、単語分析部２０で抽出された各文章に使用されている任意の単語を基軸単語として選択し、基軸単語毎の単語情報に基づき、所定の基軸単語共起判定ルールで基軸単語と共起関係とみなされる共起語とその共起数とで表される基軸単語共起語ベクトルを、全基軸単語についてまとめた基軸単語共起表を作成する（ステップＡ３）。
攪乱値算出部４０は、共起情報抽出部３０にて作成した基軸単語共起表と、単語ペアの出現数や出現頻度、あるいは登録文書の量（文字数）、あるいは登録文書中の単語のバリエーション数の少なくとも一つを利用して、単語ペア候補毎あるいは文書全体に攪乱値ｋを定義する（ステップＡ４）。
同義語候補推定部５０は、各基軸単語に対応する基軸単語共起語ベクトル間の類似性を所定の判定基準によって判定し、基軸単語共起語ベクトルの意味的な類似性が高く、同義語の可能性が想定される基軸単語の組合せを同義語候補として順次抽出（推定）する（ステップＡ５）。
同義語候補出力部６０は、同義語候補推定部５０で抽出（推定）した同義語候補を出力する（ステップＡ６）。
次に、本発明の第１の実施形態に係る同義語抽出システム２００の効果について説明する。
本第１の実施形態では、文書内もしくは文書群内の基軸単語共起語ベクトルに、単語ペアの出現数や出現頻度、あるいは登録文書の量（文字数）、あるいは登録文書中の単語のバリエーション数から算出された攪乱値ｋを加えることで、各単語の出現回数が少なく基軸単語共起語ベクトルが疎行列で類似の判定が困難な文章量の少ない条件でも類似性の評価が可能になり、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書から意義は同じで語形が異なっている同義語を抽出できる。すなわち、先行技術の第一の課題を解決することができる。
尚、上記本発明の第１の実施形態に係る文書中の同義語抽出システム２００は、同義語抽出方法として実現され得る。また、上記本発明の第１の実施形態に係る文書中の同義語抽出システム２００は、同義語抽出プログラムによりコンピュータによって実行させるようにしても良い。
［実施形態２］
次に、本発明の第２の実施形態について、図面を参照して詳細に説明する。
図３は、本発明の第２の実施形態に係る文書中の同義語抽出システム２１０の構成を示すブロック図である。
図３を参照すると、本発明の第２の実施形態に係る同義語抽出システム２１０は、基本的に電子機器内もしくはサーバと電子機器およびこれらを相互に接続するインターネット等の情報通信ネットワークからなるシステム内に、少なくとも、文書入力部１０、単語分析部２０、概念情報抽出部３１、攪乱値算出部４０、同義語候補推定部５０、同義語候補出力部６０、単語データベース１００、及び概念データベース１１０を含む。
文書入力部１０は、分析対象である文書の入力を受け付ける。単語分析部２０は、文書中の各文および複合語に形態素解析および構文解析を適用し、各単語の品詞や係り受け関係を抽出する。概念情報抽出部３１は、各単語の概念ベクトルを、単語間の文脈情報として抽出する。攪乱値算出部４０は、概念ベクトルに追加すべき単語毎の攪乱値ｋを、単語ペアの出現数あるいは出現頻度と、文書量、文書中の単語のバリエーション数、共起語の概念集約階層とから算出する。同義語候補推定部５０は、攪乱値を用いた補正コサイン類似度に基づき、同義関係を持つ単語ペア候補を同義語候補として推定する。同義語候補出力部６０は、推定した同義語候補を提示する。
単語データベース１００は、単語の品詞や構文などの情報を収集して蓄積し、単語分析部２０からの特定の単語に関する問い合わせに対し、単語の品詞や構文に関連する情報を検索し応答する、データベースである。
概念データベース１１０は、単語の概念分類、同義語、類義語、用法といった単語の一般概念を体系付けた一般概念情報を収集して蓄積し、概念情報抽出部３１からの特定の単語に関する問い合わせに対し、単語の意味や用法に関連する一般概念情報を検索し応答する、データベースである。
図示の同義語抽出システム２１０を上述したコンピュータで実現した場合、入力装置が文書入力部１０として働き、データ処理装置が、単語分析部２０、概念情報抽出部３１、攪乱値算出部４０、および同義語候補推定部５０として働き、補助記憶装置が単語データベース１００および概念データベース１１０として動作し、出力装置が同義語候補出力部６０として働く。
次に、同義語抽出システム２１０を構成する各構成要素を詳細に説明する。
なお、本第２の実施形態に係る同義語抽出システム２１０において、文書入力部１０と、単語分析部２０と、同義関係を持つ単語ペア候補を抽出する同義語候補推定部５０と、同義語候補出力部６０と、単語データベース１００とは、上述した第１の実施形態に係る同義語抽出システム２００のそれら（対応物）と同等である。したがって、以下では、重複した説明を省くために、相違点についてのみ詳細に説明する。
概念データベース１１０は、収集された単語の概念分類および一般的な同義語、類義語、用法などの一般概念情報を収集して蓄積し、特定の単語に関する問い合わせに対し、単語の意味や用法に関連する一般概念情報を検索し応答するデータベースである。概念データベース１１０は、単語の上位／下位関係、部分／全体関係、同義関係、類義関係などによって単語を分類し、体系づけたシソーラスなどが相当する。なお、概念データベース１１０として、インターネット上のデータベースを使用することとしてもよい。
概念情報抽出部３１は、各単語の共起語ベクトルを集約した概念ベクトルを、単語間の文脈情報として抽出する。
共起語ベクトルの抽出としては、上記第１の実施形態で記載した方法を利用することができる。概念ベクトルの抽出としては、具体的には、基軸単語共起表の基軸単語共起語ベクトルの各共起語のそれぞれについて、概念データベース１１０に一般概念情報を問い合わせ、任意の範囲内で基軸単語共起表における各基軸単語共起語ベクトルの各共起語を概念に変換した基軸単語概念ベクトルを、全基軸単語についてまとめた基軸単語概念表を作成することができる。
概念への変換で異なる共起語が同じ概念となる場合、概念情報抽出部３１は、それぞれの共起語を合流し、共起数の和を対応箇所へ登録する。
また、概念データベース１１０として大分類、中分類、小分類のような複数の階層での概念が一般概念情報として登録されたシソーラスを用いる場合、概念情報抽出部３１は、階層毎に概念表を作成し、大分類など広い概念での基軸単語概念表で異なる共起語が同じ概念となる場合は、それぞれの共起語を合流し、共起数の和を対応箇所へ登録する。他に、概念データベース１１０として同義語を含む類義語群が一般概念情報として登録された類語辞書を用いた場合、概念情報抽出部３１は、共起語を対応する類義語群の各類義語に変換し、各類義語の共起数として対応する共起語の共起数を割り当て、同一の基軸単語の共起語に関して変換された類義語毎の共起数の延べ数を基軸単語概念ベクトルとして算出しても良い。
なお、概念データベース１１０に共起語に対応する概念が無い場合、概念情報抽出部３１は、上記共起語を概念に変換せず、共起語の単語をそのまま概念として扱い残す。
攪乱値算出部４０は、概念情報抽出部３１にて抽出した基軸単語概念ベクトルに、直交ベクトルを追加し、共起数が少ない単語ペアに対する補正を行う。
基本的な操作は、共起語ベクトルを用いた上記第１の実施形態と同等である。攪乱値ｋは、単語ペアの出現数や出現頻度、あるいは登録文書の量（文字数）、あるいは登録文書中の単語のバリエーション数、共起語の概念集約階層の少なくとも一つを利用して、単語ペア候補毎あるいは文書全体の単語ペア候補に攪乱値ｋを定義することができる。
次に、図３及び図４のシーケンス図を参照して、本発明の第２の実施形態に係る同義語抽出システム２１０の全体の動作について詳細に説明する。なお、図４に示すシーケンス図および以下の説明は処理例であり、第１の実施形態と同様に処理順を入れ替えたり処理を戻したりすることを行ってもよい。
文書入力部１０は、対象とする文書もしくは文書群の入力を受け付ける（ステップＢ１）。
単語分析部２０は、文書もしくは文書群を構成する各文章に形態素解析や構文解析を適用することで、各文章に使用されている全単語の抽出および単語毎の品詞や格、組み合される助詞、単語間の係り受け関係に関する単語情報の抽出を行う（ステップＢ２）。
単語データベース１００は、単語の品詞や構文などの情報を収集して蓄積し、特定の単語に関する問い合わせに対し、単語の品詞や構文に関連する情報を検索し応答する（ステップＢ３）。
概念情報抽出部３１は、単語分析部２０で抽出された各文章に使用されている任意の単語を基軸単語として選択し、基軸単語毎の単語情報に基づき、任意の基軸単語共起判定ルールで基軸単語と共起関係とみなされる共起語とその共起数とで表される基軸単語共起語ベクトルの各共起語を概念に変換した基軸単語概念ベクトルを、全基軸単語についてまとめた基軸単語概念表を作成する（ステップＢ４）。
概念データベース１１０は、単語の概念分類、同義語、類義語、用法といった単語の一般概念を体系付けた一般概念情報を収集して蓄積し、特定の単語に関する問い合わせに対し、単語の意味や用法に関連する一般概念情報を検索し応答する（ステップＢ５）。
攪乱値算出部４０は、概念情報抽出部３１にて作成した基軸単語概念表と、単語ペアの出現数や出現頻度、登録文書の量（文字数）、登録文書中の単語のバリエーション数、および共起語の概念集約階層の少なくとも一つを利用して、単語ペア候補毎あるいは文書全体に攪乱値ｋを定義する（ステップＢ６）。
同義語候補推定部５０は、各基軸単語に対応する基軸単語概念ベクトル間の類似性を所定の判定基準によって判定し、基軸単語概念ベクトルの意味的な類似性が高く、同義語の可能性が想定される基軸単語の組合せを同義語候補として抽出（推定）する（ステップＢ７）。
同義語候補出力部６０は、同義語候補推定部５０で抽出（推定）した同義語候補を出力する（ステップＢ８）。
次に、本発明の第２の実施形態に係る同義語抽出システム２１０の効果について説明する。
第２の実施形態では、文書内もしくは文書群内の基軸単語概念ベクトルに、単語ペアの出現数や出現頻度、あるいは登録文書の量（文字数）、あるいは登録文書中の単語のバリエーション数、あるいは共起語の概念集約階層から算出された攪乱値ｋを加えることで、各単語の出現回数が少なく基軸単語概念ベクトルが疎行列で類似の判定が困難な文章量の少ない条件でも、より的確に類似性の評価が可能になり、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書から意義は同じで語形が異なっている同義語を抽出できる。すなわち、先行技術の第二の課題を解決することができる。
尚、上記本発明の実施形態に係る文書中の同義語抽出システム２１０は、同義語抽出方法として実現され得るが、同義語抽出プログラムによりコンピュータによって実行させるようにしても良い。Next, two embodiments according to the present invention will be described in detail with reference to the drawings. The present invention is not limited to these embodiments.
[Embodiment 1]
FIG. 1 is a block diagram showing a configuration of a synonym extraction system 200 according to the first embodiment of the present invention.
Referring to FIG. 1, a synonym extraction system 200 according to the first embodiment of the present invention is basically a system comprising an electronic device or a server and an electronic device, and an information communication network such as the Internet for interconnecting them. The document input unit 10, the word analysis unit 20, the co-occurrence information extraction unit 30, the disturbance value calculation unit 40, the synonym candidate estimation unit 50, the synonym candidate output unit 60, and the word database 100 are included therein.
The document input unit 10 receives input of a document to be analyzed. The word analysis unit 20 applies morphological analysis and syntax analysis to each sentence and compound word in the document, and extracts parts of speech and dependency relationships of each word. The co-occurrence information extraction unit 30 extracts a co-occurrence word vector based on the co-occurrence word of each word as context information between words. The disturbance value calculation unit 40 calculates the disturbance value k for each word to be added to the co-occurrence word vector from the number of appearances or frequency of word pairs, the amount of documents, and the number of variations of words in the document. The synonym candidate estimation unit 50 estimates a word pair candidate having a synonym relationship as a synonym candidate based on the corrected cosine similarity using the disturbance value. The synonym candidate output unit 60 presents the estimated synonym candidates.
The word database 100 collects and accumulates information such as part of speech and syntax of a word, and searches and responds to information related to the part of speech and syntax of a word in response to an inquiry about a specific word from the word analysis unit 20. It is. Note that a database on the Internet may be used as the word database 100.
The illustrated synonym extraction system 200 is particularly effective for extracting synonyms from documents having synonyms that are formed only in a document group related to a specific project, such as proposals and specifications regarding information system construction. This is a synonym extraction system.
When the synonym extraction system according to the first embodiment is configured by an electronic device, the synonym extraction system 200 can be realized by a computer that operates under program control. As is well known, the computer includes an input device for inputting data, a data processing device, an output device for outputting processing results in the data processing device, and an auxiliary storage device that functions as various databases. is there. The data processing apparatus includes a read only memory (ROM) for storing a program, a random access memory (RAM) used as a work area for temporarily storing data, and a RAM according to the program stored in the ROM. And a central processing unit (CPU) for processing stored data.
In this case, the input device functions as the document input unit 10. The data processing device functions as the word analysis unit 20, the co-occurrence information extraction unit 30, the disturbance value calculation unit 40, and the synonym candidate estimation unit 50. The auxiliary storage device operates as the word database 100. The output device functions as the synonym candidate output unit 60.
Next, each component which comprises the synonym extraction system 200 is demonstrated in detail.
The document input unit 10 receives registration (input) of a document or document group to be analyzed.
The document or document group in the embodiment of the present invention refers to a specific document or document group described in a natural language, and one of them is a request for proposal (RFP) submitted by a customer when an information system is constructed. There are proposals, requirement definitions, and basic designs, functional designs, test specifications, etc. that vendors submit to customers, but this is not a limitation.
The word analysis unit 20 applies morphological analysis and syntax analysis to each sentence constituting a document or a document group, thereby extracting all words used in each sentence, part-of-speech and case for each word, combined particles, Extract word information related to dependency between words.
Here, the word may be limited to a self-supporting word such as a noun, a verb, or an adjective. The word information may include dependency relationships between words as necessary.
Specifically, the word analysis unit 20 inquires the word database 100 for word information, and applies morphological analysis and syntax analysis to each sentence constituting the document or the document group, so that all words used in each sentence. And word information such as part-of-speech for each word can be extracted.
The co-occurrence information extraction unit 30 extracts a co-occurrence word vector based on the co-occurrence word of each word as context information between words. For extracting the co-occurrence word vector, specifically, an arbitrary word used in each sentence extracted by the word analysis unit 20 is selected as a base word, and based on word information for each base word, an arbitrary word is selected. Create a base word co-occurrence table that summarizes the base word co-occurrence vectors represented by the base word co-occurrence rules and the co-occurrence words that are considered as co-occurrence relations with the base words and the number of co-occurrence for all the base words .
Here, the basic word co-occurrence determination rule includes one sentence, all sentences in one paragraph, all sentences in the same item on the table of contents, the whole document, and the range considered as co-occurrence words according to the characteristics of the document. The range considered as co-occurrence may be changed for each part of speech such as a verb that co-occurs in one sentence and a noun in a sentence in the same item on the table of contents. Further, when the word information includes a dependency relationship between words, whether the word has a dependency relationship may be used as the basic word co-occurrence determination rule. The number of co-occurrence may be the number of co-occurrence, but may be a frequency obtained by dividing the number of co-occurrence by the total number of co-occurrence words for each basic word.
The key word co-occurrence table is a matrix in which each row corresponds to each key word and each column corresponds to each co-occurrence word, and the number of co-occurrence words of the key word corresponding to the key word is registered as each value of the table. It has been done. Note that the base word is reciprocal, and even if it is a word previously selected as the base word, it can be treated as a co-occurrence word if another word is later considered as the base word.
The disturbance value calculation unit 40 adds an orthogonal vector to the basic word co-occurrence word vector extracted by the co-occurrence information extraction unit 30, and corrects a word pair having a small number of co-occurrence. Let the value of k in this orthogonal vector be the disturbance value. The following formula 1 is an example of the correction.

Here, the disturbance value k is the word pair candidate or the entire document using at least one of the number and frequency of word pairs, the amount of registered documents (number of characters), or the number of variations of words in the registered document. Derived for a word pair candidate.
As a method for deriving the disturbance value k, there are a basic disturbance value k0 regarding the word pair candidates of the entire document obtained by a function that is monotonically increasing with the amount of registered documents (number of characters) or the number of word variations in the registered document, and an arbitrary value. The sum of the number of occurrences of the word pair candidates or the product of the pair disturbance coefficient α obtained by a function having a monotonically increasing relationship with the larger number of occurrences can be considered.
In addition, by setting the disturbance value k for each word pair candidate, such as the number of occurrences and the appearance frequency of word pairs, the basic word co-occurrence word vectors are the same or the similarity is the same according to a predetermined criterion. It is possible to accurately determine the similarity between word pair candidates. This is because the number of co-occurrence words decreases when the number and frequency of occurrences of word pairs is low, or when the amount of documents is small, and the degree of similarity between word pairs that accidentally used the same co-occurrence word is high. This is because priority can be given to the degree of similarity derived by the amount of co-occurrence words that is not affected by accidents.
The synonym candidate estimator 50 calculates a word pair candidate having a synonym relationship based on the corrected cosine similarity (equation 2) obtained by introducing the disturbance value k calculated by the disturbance value calculator 40 into the cosine similarity calculation formula. Determine similarity. The vectors a and b in Equation 2 below indicate the basic word co-occurrence word vectors of word pair candidates. The number of columns of the basic word co-occurrence word vector into which the disturbance value k is introduced is preferably 4 or more, but is not limited thereto. There is no particular upper limit on the number of columns.

The synonym candidate output unit 60 outputs the synonym candidates estimated by the synonym candidate estimation unit 50.
Here, a suitable output form is a form in which the entire document is output by clearly indicating the combination of synonym candidates in the document by color coding or bold emphasis. In addition, the output form may be a form such as a table from which synonym candidate combinations are extracted. At that time, it is possible to display only the ranking top of the ranking table indicating the similarity of word pairs in each extraction condition, or display the result of summing up each extraction condition. In addition, it displays a graph that links the relation between the main word as a synonym candidate as a main node, its co-occurrence word as an intermediate node, and the concept as an end node, and links the basic words as synonym candidates as short as possible. It may be in a form such as highlighting the links by color. Furthermore, quantitative synonyms are added between synonyms, such as the dissimilarity used when extracting synonym candidates, and the display is limited to synonyms whose synonyms are larger than an arbitrarily set threshold. Alternatively, depending on the degree of synonym between synonym candidates, color coding, bold emphasis, or the character size of a word in a graph may be given.
Further, it may be possible to shift from the base display form to a table or a graph as necessary so that each output form can be selected. Moreover, you may make it selectively output a verb, a noun, etc. as needed.
Next, the overall operation of the synonym extraction system 200 according to the first embodiment of the present invention will be described in detail with reference to the sequence diagrams of FIGS. 1 and 2. Note that the sequence diagram shown in FIG. 2 and the following description are processing examples, and the processing order and the like may be changed or the processing may be returned or repeated depending on the processing that is appropriately obtained.
The document input unit 10 receives an input of a target document or document group (step A1).
The word analysis unit 20 applies morphological analysis and syntactic analysis to each sentence constituting a document or a document group, thereby extracting all words used in each sentence, part of speech and case for each word, combined particles, Word information relating to the dependency relationship between words is extracted (step A2).
At this time, the word database 100 collects and stores information such as the part of speech and syntax of the word, and searches and responds to information related to the part of speech and syntax of the word in response to an inquiry regarding a specific word (step A3).
The co-occurrence information extraction unit 30 selects an arbitrary word used in each sentence extracted by the word analysis unit 20 as a base word, and a predetermined base word co-occurrence determination rule based on the word information for each base word A basic word co-occurrence table is created in which the basic word co-occurrence vectors represented by the co-occurrence words regarded as co-occurrence relations with the basic word and the number of the co-occurrence are collected for all the basic words (step A3).
The disturbance value calculation unit 40 generates the basic word co-occurrence table created by the co-occurrence information extraction unit 30, the number of appearances and the appearance frequency of word pairs, the amount of registered documents (number of characters), or variations of words in the registered document. Using at least one of the numbers, a disturbance value k is defined for each word pair candidate or for the entire document (step A4).
The synonym candidate estimation unit 50 determines the similarity between the basic word co-occurrence word vectors corresponding to each basic word according to a predetermined criterion, and the synonym is high in the semantic similarity of the basic word co-occurrence word vectors. The basic word combinations that are assumed to be possible are sequentially extracted (estimated) as synonym candidates (step A5).
The synonym candidate output unit 60 outputs the synonym candidates extracted (estimated) by the synonym candidate estimation unit 50 (step A6).
Next, the effect of the synonym extraction system 200 according to the first embodiment of the present invention will be described.
In the first embodiment, the basic word co-occurrence word vector in a document or document group includes the number of occurrences and frequency of word pairs, the amount of registered documents (number of characters), or the number of variations of words in the registered document. By adding the disturbance value k calculated from the above, it becomes possible to evaluate the similarity even under the condition where the number of occurrences of each word is small and the basic word co-occurrence vector is a sparse matrix and the similarity is difficult to determine, and the amount of sentences is small. Synonyms that have the same meaning but different word forms can be extracted from documents related to specific projects such as proposals and specifications related to information system construction. That is, the first problem of the prior art can be solved.
The synonym extraction system 200 in the document according to the first embodiment of the present invention can be realized as a synonym extraction method. The synonym extraction system 200 in the document according to the first embodiment of the present invention may be executed by a computer using a synonym extraction program.
[Embodiment 2]
Next, a second embodiment of the present invention will be described in detail with reference to the drawings.
FIG. 3 is a block diagram showing a configuration of a synonym extraction system 210 in a document according to the second embodiment of the present invention.
Referring to FIG. 3, a synonym extraction system 210 according to the second embodiment of the present invention is basically a system composed of an information communication network such as an Internet or a server and an electronic device and the Internet for interconnecting them. The document input unit 10, the word analysis unit 20, the concept information extraction unit 31, the disturbance value calculation unit 40, the synonym candidate estimation unit 50, the synonym candidate output unit 60, the word database 100, and the concept database 110 are included therein. Including.
The document input unit 10 receives input of a document to be analyzed. The word analysis unit 20 applies morphological analysis and syntax analysis to each sentence and compound word in the document, and extracts parts of speech and dependency relationships of each word. The concept information extraction unit 31 extracts the concept vector of each word as context information between words. The disturbance value calculation unit 40 determines the disturbance value k for each word to be added to the concept vector, the number of occurrences or frequency of word pairs, the amount of documents, the number of variations of words in the document, and the concept aggregation hierarchy of co-occurrence words. Calculate from The synonym candidate estimation unit 50 estimates a word pair candidate having a synonym relationship as a synonym candidate based on the corrected cosine similarity using the disturbance value. The synonym candidate output unit 60 presents the estimated synonym candidates.
The word database 100 collects and accumulates information such as part of speech and syntax of a word, and searches and responds to information related to the part of speech and syntax of a word in response to an inquiry about a specific word from the word analysis unit 20. It is.
The concept database 110 collects and accumulates general concept information that organizes general concepts of words such as word concept classifications, synonyms, synonyms, and usages, and inquires about specific words from the concept information extraction unit 31. It is a database that searches and responds to general concept information related to the meaning and usage of words.
When the illustrated synonym extraction system 210 is realized by the above-described computer, the input device functions as the document input unit 10, and the data processing device includes the word analysis unit 20, the concept information extraction unit 31, the disturbance value calculation unit 40, and the synonym. The auxiliary storage device operates as the word database 100 and the concept database 110, and the output device functions as the synonym candidate output unit 60.
Next, each component which comprises the synonym extraction system 210 is demonstrated in detail.
In the synonym extraction system 210 according to the second embodiment, the document input unit 10, the word analysis unit 20, the synonym candidate estimation unit 50 that extracts word pair candidates having a synonym relationship, and synonym candidates The output unit 60 and the word database 100 are equivalent to those (corresponding items) of the synonym extraction system 200 according to the first embodiment described above. Therefore, in the following, only the differences will be described in detail in order to omit redundant description.
The concept database 110 collects and accumulates general concept information such as concept classifications of the collected words and general synonyms, synonyms, usages, etc., and relates to the meanings and usages of the words when inquiring about specific words. A database that searches and responds to general concept information. The concept database 110 corresponds to a thesaurus or the like in which words are classified and organized by upper / lower relationship, partial / whole relationship, synonym relationship, synonym relationship, and the like. Note that a database on the Internet may be used as the concept database 110.
The concept information extraction unit 31 extracts a concept vector obtained by collecting the co-occurrence word vectors of each word as context information between words.
For extracting the co-occurrence word vector, the method described in the first embodiment can be used. As the concept vector extraction, specifically, for each co-occurrence word of the base word co-occurrence word vector of the base word co-occurrence table, the general concept information is inquired to the concept database 110, and the base word within an arbitrary range. It is possible to create a basic word concept table in which basic word concept vectors obtained by converting each co-occurrence word of each basic word co-occurrence word vector in the co-occurrence table into concepts are collected for all basic words.
When different co-occurrence words become the same concept in the conversion to the concept, the concept information extraction unit 31 merges the respective co-occurrence words and registers the sum of the co-occurrence numbers in the corresponding locations.
Further, when a thesaurus in which concepts at a plurality of levels such as major classification, middle classification, and minor classification are registered as general concept information is used as the concept database 110, the concept information extraction unit 31 creates a concept table for each hierarchy. When different co-occurrence words become the same concept in the basic word concept table in a broad concept such as a large classification, the co-occurrence words are merged and the sum of the co-occurrence numbers is registered in the corresponding part. In addition, when using a synonym dictionary in which synonyms including synonyms are registered as general concept information as the concept database 110, the concept information extraction unit 31 converts the co-occurrence words into synonyms of the corresponding synonyms, The co-occurrence number of the corresponding co-occurrence word may be assigned as the co-occurrence number of each synonym, and the total number of co-occurrence numbers for each synonym converted for the co-occurrence word of the same basic word may be calculated as the basic word concept vector. .
If there is no concept corresponding to the co-occurrence word in the concept database 110, the concept information extraction unit 31 does not convert the co-occurrence word into a concept and leaves the word of the co-occurrence word as a concept as it is.
The disturbance value calculation unit 40 adds an orthogonal vector to the basic word concept vector extracted by the concept information extraction unit 31, and corrects a word pair with a small number of co-occurrence.
The basic operation is the same as that in the first embodiment using the co-occurrence word vector. The disturbance value k is obtained by using at least one of the number and frequency of word pairs, the amount of registered documents (number of characters), the number of word variations in the registered document, and the concept aggregation hierarchy of co-occurrence words. A disturbance value k can be defined for each pair candidate or word pair candidate for the entire document.
Next, the overall operation of the synonym extraction system 210 according to the second embodiment of the present invention will be described in detail with reference to the sequence diagrams of FIGS. 3 and 4. The sequence diagram shown in FIG. 4 and the following description are processing examples, and the processing order may be changed or the processing may be returned as in the first embodiment.
The document input unit 10 receives an input of a target document or document group (step B1).
The word analysis unit 20 applies morphological analysis and syntax analysis to each sentence constituting a document or a document group, thereby extracting all words used in each sentence, part-of-speech and case for each word, combined particles, Word information relating to the dependency relationship between words is extracted (step B2).
The word database 100 collects and accumulates information such as part of speech and syntax of a word, and searches and responds to information related to the part of speech and syntax of a word in response to an inquiry about a specific word (step B3).
The concept information extraction unit 31 selects an arbitrary word used in each sentence extracted by the word analysis unit 20 as a base word, and based on word information for each base word, an arbitrary base word co-occurrence determination rule The basic word concept vector, which is a co-occurrence word converted from the co-occurrence word of the basic word co-occurrence vector represented by the co-occurrence word and the co-occurrence number regarded as the co-occurrence relationship with the basic word, is summarized for all basic words. A basic word concept table is created (step B4).
The concept database 110 collects and accumulates general concept information that organizes general concepts of words such as word concept classification, synonyms, synonyms, and usage, and relates to the meaning and usage of words for inquiries about specific words. The general concept information to be searched is retrieved and responded (step B5).
The disturbance value calculation unit 40, the basic word concept table created by the concept information extraction unit 31, the number and appearance frequency of word pairs, the amount of registered documents (number of characters), the number of word variations in the registered document, A disturbance value k is defined for each word pair candidate or for the entire document using at least one of the concept summarization hierarchies (step B6).
The synonym candidate estimation unit 50 determines the similarity between the basic word concept vectors corresponding to each basic word according to a predetermined criterion, and the semantic similarity of the basic word concept vectors is high, and the possibility of a synonym is high. An assumed combination of key words is extracted (estimated) as a synonym candidate (step B7).
The synonym candidate output unit 60 outputs the synonym candidates extracted (estimated) by the synonym candidate estimation unit 50 (step B8).
Next, the effect of the synonym extraction system 210 according to the second embodiment of the present invention will be described.
In the second embodiment, the basic word concept vector in a document or document group includes the number of occurrences and frequency of word pairs, the amount of registered documents (number of characters), the number of variations of words in a registered document, By adding the disturbance value k calculated from the concept aggregation hierarchy of words, it is more accurately similar even under conditions where the number of occurrences of each word is small and the basic word concept vector is sparse and the similarity is difficult to determine. It is possible to evaluate the gender, and synonyms having the same meaning but different word forms can be extracted from documents related to specific projects such as proposals and specifications concerning information system construction. That is, the second problem of the prior art can be solved.
The synonym extraction system 210 in the document according to the embodiment of the present invention can be realized as a synonym extraction method, but may be executed by a computer using a synonym extraction program.

次に、具体的な第１の実施例を用いて、本発明の第１の実施形態に係る同義語抽出システム２００について具体的に説明する。
同義語抽出システム２００は、文書入力部１０から、分析実施者が特定の案件に関する文書群でのみ成り立つ同義語候補を推定したい文書群を構成する文書の入力を受け付ける。
そして、同義語抽出システム２００は、文書に形態素解析および構文解析を適用し、文書を構成する単語に分解し、単語毎の抽出元の文および品詞を解析することで、名詞、動詞、形容詞、および形容動詞を単語として抽出する。なお、動詞の内でサ行変格活用に属する動詞は、活用部分を除去しいわゆるサ変名詞化した形態で抽出する。
さらに、同義語抽出システム２００は、文書に含まれる単語の内で名詞を単語Ｓとし、各単語ｉ（ｉ＝１、２、・・・、ｎ）について、特定の単語Ｓｉと同一文中で共起関係にある名詞、動詞、形容詞を、共起語Ｖｊ（ｊ＝１、２、・・・、ｍ）として抽出し、単語Ｓｉに対する各共起語Ｖｉｊの共起回数を共起数Ｎｉｊとして集計し、全ての単語Ｓｉに対する各共起語Ｖｉについて表形式にまとめた単語共起表Ｅを作成する。なお、単語共起表Ｅの単語Ｓｉに対する各共起語Ｖｊの共起数Ｎｉｊをまとめたデータセットを単語共起語ベクトルＮｉと呼ぶ。
単語共起表Ｅの一例を図５に示す。図中、「解析システム」、「計測システム」、「分析システム」が基軸単語Ｓであり、「精度」、「応答時間」、「手順」、「重量」、「価格」、「サイズ」、「処理」等が各基軸単語Ｓの共起語Ｖである。また、図中の数字が共起回数である。基軸単語Ｓの行のデータセットが基軸単語共起語ベクトルＮに相当するため、「解析システム」の共起語ベクトルＮ１は（２、２、１、３、１、２、２、・・・）、「計測システム」の共起語ベクトルＮ２は（１、１、０、２、１、０、２、・・・）、「分析システム」の共起語ベクトルＮ３は（０、１、１、０、１、１、０、・・・）となる。
なお、本第１の実施例では、この３つの基軸単語は文書中で同一とみなすことができ、同義語候補として抽出（推定）すべきものであるとする。
図６は、基軸単語ペア数と、上述の補正コサイン類似度により、類似度ランキングを導出した結果の一部を示している。ここで、上記基軸単語ペアの単語数とは、単語ペア毎の単語の出現数の和でも良いし、出現数が小さい方の単語出現数もしくは出現数が大きい方の単語出現数でも良い。なお、ｋ＝０は攪乱値ｋを導入しない場合であり、通常のコサイン類似度で分析した結果となる。ここで攪乱値ｋは以下の数式３で算出する。

ここで、αは過去の実績に基づく経験的な値で、ここでは０．１を適用する。
図６中、「計測システム」「分析システム」のペア数が５であるため、「解析システム」「計測システム」のペアではｋ＝３を、「解析システム」「分析システム」のペアではｋ＝２を、「計測システム」「分析システム」のペアではｋ＝１を利用する。このように基軸単語ペア数でｋ値を設定することで、同義語候補すなわち同義語の可能性の高い基軸単語ペアの類似度ランキングを上昇させることが可能となる。
また、攪乱値ｋを以下の数式４に基づき算出しても良い。

ここで、βは過去の実績に基づく経験的な値で、ここでは０．００００１である。
なお、具体的な同義語候補の出力としては、類似度ランキングが１位となる複数の基軸単語ペアを分析実施者に提示することが有効である。Next, the synonym extraction system 200 according to the first embodiment of the present invention will be specifically described using a specific first example.
The synonym extraction system 200 receives from the document input unit 10 an input of a document that constitutes a document group for which an analyst wants to estimate a synonym candidate that is established only in a document group related to a specific case.
Then, the synonym extraction system 200 applies morphological analysis and syntactic analysis to the document, decomposes it into words constituting the document, and analyzes the sentence and part of speech of the extraction source for each word, thereby providing a noun, verb, adjective, And adjective verbs are extracted as words. Of the verbs, the verbs belonging to the sa line modification use are extracted in the form of so-called sa change nouns by removing the use part.
Furthermore, the synonym extraction system 200 uses a noun as a word S among words included in a document, and each word i (i = 1, 2,..., N) is shared with a specific word Si in the same sentence. Nouns, verbs, and adjectives that are in relation to each other are extracted as co-occurrence words Vj (j = 1, 2,..., M), and the number of co-occurrence of each co-occurrence word Vij with respect to the word Si is defined as a co-occurrence number Nij. A word co-occurrence table E is compiled and tabulated for each co-occurrence word Vi for all words Si. A data set in which the number of co-occurrence Nij of each co-occurrence word Vj for the word Si in the word co-occurrence table E is referred to as a word co-occurrence word vector Ni.
An example of the word co-occurrence table E is shown in FIG. In the figure, “analysis system”, “measurement system”, “analysis system” are the basic words S, “accuracy”, “response time”, “procedure”, “weight”, “price”, “size”, “ “Process” or the like is the co-occurrence word V of each basic word S. The number in the figure is the number of co-occurrence. Since the data set in the row of the base word S corresponds to the base word co-occurrence word vector N, the co-occurrence word vector N1 of the “analysis system” is (2, 2, 1, 3, 1, 2, 2,... ), The co-occurrence word vector N2 of “measurement system” is (1, 1, 0, 2, 1, 0, 2,...), And the co-occurrence word vector N3 of “analysis system” is (0, 1, 1). , 0, 1, 1, 0,...
In the first embodiment, it is assumed that the three basic words can be regarded as identical in the document and should be extracted (estimated) as synonym candidates.
FIG. 6 shows a part of the result of deriving the similarity ranking based on the number of basic word pairs and the above-described corrected cosine similarity. Here, the number of words of the basic word pair may be the sum of the number of appearances of words for each word pair, or may be the number of appearances of words with a smaller number of appearances or the appearance of words with a larger number of appearances. Note that k = 0 is a case where the disturbance value k is not introduced, and is a result of analysis with a normal cosine similarity. Here, the disturbance value k is calculated by Equation 3 below.

Here, α is an empirical value based on past results, and 0.1 is applied here.
In FIG. 6, since the number of pairs of “measurement system” and “analysis system” is 5, k = 3 for the pair of “analysis system” and “measurement system” and k = for the pair of “analysis system” and “analysis system”. 2 is used for a pair of “measurement system” and “analysis system”. Thus, by setting the k value by the number of key word pairs, it is possible to increase the similarity ranking of key word pairs that are highly likely to be synonyms, that is, synonyms.
Further, the disturbance value k may be calculated based on Equation 4 below.

Here, β is an empirical value based on past results, and is 0.00001 here.
As an output of specific synonym candidates, it is effective to present a plurality of basic word pairs whose similarity ranking is first to the analyst.

次に、具体的な第２の実施例を用いて、本発明の第２の実施形態に係る同義語抽出システム２１０について具体的に説明する。
第２の実施例では、共起語を概念でまとめた概念ベクトルを作成し、そのようにして作成した概念ベクトルに攪乱値ｋを導入する場合を説明する。この方法は共起語が少なく結果として共起語ベクトルが疎となる場合には特に有効といえる。具体的にそのフローを説明する。
まず、同義語抽出システム２１０は、第１の実施例と同様のプロセスで単語共起表Ｅを作成する。
次に、同義語抽出システム２１０は、単語共起表Ｅの各共起語Ｖｊのそれぞれの一般概念情報Ｃｇについて概念データベース１１０に問い合わせを行うことで、概念データベース１１０に保存されたシソーラスの一般概念情報Ｃｇの分類体系から各共起語Ｖｊが属する大分類の共起語概念Ｃ１ｖｊと、中分類の共起語概念Ｃ２ｖｊと、小分類の共起語概念Ｃ３ｖｊとを抽出する。その後、同義語抽出システム２１０は、単語共起表Ｅにおける共起語Ｖｊを共起語概念Ｃ１ｖｊに変換し、同じ概念となる共起語Ｖｉをまとめ、共起数Ｎｉｊの和を対応箇所へ登録した大分類の単語概念表ＳＣ１を作成する。また、同義語抽出システム２１０は、単語共起表Ｅにおける共起語Ｖｊを共起語概念Ｃ２ｖｊに変換し、同じ概念となる共起語Ｖｉをまとめ、共起数Ｎｉｊの和を対応箇所へ登録した中分類の単語概念表ＳＣ２を作成するさらに、同義語抽出システム２１０は、単語共起表Ｅにおける共起語Ｖｊを共起語概念Ｃ３ｖｊに変換し、同じ概念となる共起語Ｖｉをまとめ、共起数Ｎｉｊの和を対応箇所へ登録した小分類の単語概念表ＳＣ３を作成する。
このようなプロセスにより、共起語ベクトルと同様に、大分類単語概念ベクトルＮｃ１ｉ、中分類単語概念ベクトルＮｃ２ｉ、および、小分類単語概念ベクトルＮｃ３ｉを作成することができる。
図７は基軸単語「会計システム」を例として、特定文書中にて共起語となる「管理」、「構築」、「平均」、「期間」、「方法」、「高速」、「短期」と、概念データベースから、それらを大分類単語概念、中分類単語概念、小分類単語概念にまとめたものを示している。また、図８はそれらの共起数を示している。
図９は、基軸単語の「業務システム」、「事務システム」に関して、基軸単語「会計システム」と同様に、特定文書中にて共起語となる「管理」、「構築」、「平均」、「期間」、「方法」、「高速」、「短期」と、概念データベースから、それらを大分類単語概念、中分類単語概念、小分類単語概念にまとめ、そこから中分類単語概念の共起回数を抽出し、単語共起表Ｅを作成した一例である。基軸単語Ｓの行のデータセットが基軸概念共起語ベクトルに相当するため、「会計システム」の共起語ベクトルは（５、２、３、２、２、・・・）、「業務システム」の共起語ベクトルは（３、１、２、２、１、・・・）、「事務システム」の共起語ベクトルは（１、１、０、１、０、・・・）となる。
なお、本第２の実施例では、この３つの基軸単語は文書中で同一とみなすことができ、同義語候補をして抽出（推定）すべきものであるとする。
図１０は、基軸単語ペア数と、上述の補正コサイン類似度により、類似度ランキングを導出した結果の一部を示している。なお、ｋ＝０の場合は、攪乱値ｋを導入しない場合であり、通常のコサイン類似度で分析した結果となる。ここで攪乱値ｋは、α×（単語ペア毎の単語の出現数の和）÷（単語ペア毎の単語の出現数の和の最小値）と設定する。ここで、αは１．０を適用する。
図１０中、「業務システム」「事務システム」のペア数が２５であるため、「会計システム」「事務システム」のペアではｋ＝６を、「会計システム」「業務システム」のペアではｋ＝１０を、「業務システム」「事務システム」のペアではｋ＝１を利用する。
図１０から分かるように、攪乱値を基軸単語ペア数でｋ値を設定することで、同義語候補すなわち同義語の可能性の高い基軸単語ペアの類似度ランキングを上昇させることが可能となる。なお、具体的な同義語候補の出力としては、類似度ランキングが１位となる複数の基軸単語ペアを分析実施者に提示することが有効である。
以上説明したように、本発明の同義語抽出システムによれば、情報システム構築に関する提案書や仕様書等というような、特定の案件に関する文書群でのみ成り立つ同義語を、その特定の案件に関する文書群から抽出し、それを提示することが可能となる。その結果、顧客と情報システム構築者、あるいは、情報システム構築者間の齟齬に起因する混乱を防ぎ、最終的に、齟齬に起因する情報システム構築の手戻りの削減を実現することができる。その具体的な理由は、単語ペアの出現数もしくは出現頻度、文書量、文書中の単語のバリエーション数、共起語の概念集約階層から、単語毎に攪乱値ｋを定義し、その攪乱値ｋを利用した補正コサイン類似度で文章中の同義語の類似性をランキング表示することにより、特定の案件に関する文書群という限られた文書量の情報で、単語間の類似性をより的確に算出することを可能にしているからである。
以上、実施形態（実施例）を参照して本願発明を説明したが、本願発明は上記実施形態（及び実施例）に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。Next, the synonym extraction system 210 according to the second embodiment of the present invention will be specifically described using a specific second example.
In the second embodiment, a case will be described in which a concept vector in which co-occurrence words are summarized by concept is created, and a disturbance value k is introduced into the concept vector thus created. This method is particularly effective when there are few co-occurrence words and the resulting co-occurrence word vectors are sparse. The flow will be specifically described.
First, the synonym extraction system 210 creates the word co-occurrence table E by the same process as in the first embodiment.
Next, the synonym extraction system 210 inquires the concept database 110 about the general concept information Cg of each co-occurrence word Vj of the word co-occurrence table E, so that the general concept of the thesaurus stored in the concept database 110 is obtained. A large-category co-occurrence word concept C1vj to which each co-occurrence word Vj belongs, a middle-class co-occurrence word concept C2vj, and a small-category co-occurrence word concept C3vj are extracted from the classification system of the information Cg. Thereafter, the synonym extraction system 210 converts the co-occurrence word Vj in the word co-occurrence table E into the co-occurrence word concept C1vj, collects the co-occurrence words Vi having the same concept, and sets the sum of the co-occurrence numbers Nij to the corresponding locations. A registered large word concept table SC1 is created. Further, the synonym extraction system 210 converts the co-occurrence word Vj in the word co-occurrence table E into the co-occurrence word concept C2vj, collects the co-occurrence words Vi having the same concept, and sets the sum of the co-occurrence numbers Nij to the corresponding portions. Further, the synonym extraction system 210 converts the co-occurrence word Vj in the word co-occurrence table E into the co-occurrence word concept C3vj, and creates the co-occurrence word Vi having the same concept. In summary, a small-category word concept table SC3 in which the sum of the co-occurrence numbers Nij is registered in the corresponding part is created.
By such a process, the large classification word concept vector Nc1i, the middle classification word concept vector Nc2i, and the small classification word concept vector Nc3i can be created in the same manner as the co-occurrence word vector.
FIG. 7 shows the basic word “accounting system” as an example, “management”, “construction”, “average”, “period”, “method”, “high-speed”, “short-term” as co-occurrence words in a specific document. From the concept database, these are summarized into a large classification word concept, a middle classification word concept, and a small classification word concept. FIG. 8 shows the number of such co-occurrence.
FIG. 9 shows “management”, “construction”, “average”, which are co-occurrence words in a specific document, similarly to the basic word “accounting system”, regarding the basic words “business system” and “office system”. “Period”, “Method”, “Fast”, “Short term”, from the concept database, categorize them into major classification word concept, middle classification word concept, and minor classification word concept, from which the number of co-occurrence of middle classification word concept This is an example in which the word co-occurrence table E is extracted. Since the data set in the row of the basic word S corresponds to the basic concept co-occurrence word vector, the co-occurrence word vector of “accounting system” is (5, 2, 3, 2, 2,...), “Business system”. The co-occurrence word vectors of (3, 1, 2, 2, 1,...) And the co-occurrence word vector of “office system” are (1, 1, 0, 1, 0,...).
In the second embodiment, it is assumed that the three basic words can be regarded as identical in the document and should be extracted (estimated) by synonym candidates.
FIG. 10 shows a part of the result of deriving the similarity ranking based on the number of basic word pairs and the above-described corrected cosine similarity. When k = 0, the disturbance value k is not introduced, and the result is an analysis with a normal cosine similarity. Here, the disturbance value k is set as α × (sum of the number of appearances of words for each word pair) ÷ (minimum value of the sum of the number of appearances of words for each word pair). Here, α applies 1.0.
In FIG. 10, since the number of pairs of “business system” and “office system” is 25, k = 6 for the pair of “accounting system” and “office system”, and k = for the pair of “accounting system” and “business system”. 10 is used for a pair of “business system” and “office system”.
As can be seen from FIG. 10, by setting the k value as the disturbance word number based on the number of basic word pairs, it is possible to increase the similarity ranking of basic word pairs that are highly likely to be synonyms, that is, synonyms. As an output of specific synonym candidates, it is effective to present a plurality of basic word pairs whose similarity ranking is first to the analyst.
As described above, according to the synonym extraction system of the present invention, a synonym that is formed only in a document group related to a specific case, such as a proposal or specification related to information system construction, is converted into a document related to the specific case. It is possible to extract from the group and present it. As a result, it is possible to prevent confusion caused by the trap between the customer and the information system builder, or the information system builder, and finally realize a reduction in rework of the information system configuration caused by the trap. The specific reason is that a disturbance value k is defined for each word from the number of occurrences or frequency of word pairs, the amount of documents, the number of word variations in the document, and the concept aggregation hierarchy of co-occurrence words. By displaying the similarity of synonyms in sentences with the corrected cosine similarity using the ranking, the similarity between words can be calculated more accurately with the limited amount of document information that is related to a specific project. Because it makes it possible.
Although the present invention has been described with reference to the embodiments (examples), the present invention is not limited to the above-described embodiments (and examples). Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

本発明によれば、情報システム開発における要件定義や仕様書策定において作成される各種文書において、文書中の曖昧さを除外し、文書の理解・作成・修正を支援することが可能になり、手戻りの減少や顧客満足の向上など情報システム開発の効率化に関する用途に利用可能である。また、同義語を精度良く抽出できるので、翻訳システムに用いて訳し分けに利用できる。 According to the present invention, it is possible to support the understanding, creation, and correction of documents by eliminating ambiguities in various documents created in requirement definition and specification formulation in information system development. It can be used for applications related to improving the efficiency of information system development, such as reducing returns and improving customer satisfaction. In addition, since synonyms can be extracted with high accuracy, they can be used for translation by using a translation system.

１０文書入力部
２０単語分析部
３０共起情報抽出部
３１概念情報抽出部
４０攪乱値算出部
５０同義語候補推定部
６０同義語候補出力部
１００単語データベース
１１０概念データベース
２００同義語抽出システム
２１０同義語抽出システム
この出願は、２０１２年６月２５日に出願された、日本特許出願第２０１２−１４１８７６号を基礎とする優先権を主張し、その開示の全てをここに取り込む。DESCRIPTION OF SYMBOLS 10 Document input part 20 Word analysis part 30 Co-occurrence information extraction part 31 Concept information extraction part 40 Disturbance value calculation part 50 Synonym candidate estimation part 60 Synonym candidate output part 100 Word database 110 Concept database 200 Synonym extraction system 210 Synonym Extraction System This application claims priority based on Japanese Patent Application No. 2012-141876 filed on June 25, 2012, the entire disclosure of which is incorporated herein.

Claims

A synonym extraction system that analyzes documents and extracts synonyms,
A document input unit that receives input of a document to be analyzed;
Applying a morphological analysis and a syntactic analysis to each sentence and compound word in the document, and extracting a part of speech and a dependency relationship of each word;
A co-occurrence information extraction unit that extracts a co-occurrence word vector of each word as context information between words;
A disturbance value calculation unit that calculates a disturbance value to be added to the co-occurrence word vector from at least one of the number of occurrences or the appearance frequency of a word pair, a document amount, and the number of variations of words in the document;
Based on the corrected cosine similarity using the disturbance value, a synonym candidate estimation unit that estimates a word pair candidate having a synonym relationship as a synonym candidate;
A synonym candidate output unit for presenting the estimated synonym candidates;
A synonym extraction system comprising at least

A synonym extraction system that analyzes documents and extracts synonyms,
A document input unit that receives input of a document to be analyzed;
Applying a morphological analysis and a syntactic analysis to each sentence and compound word in the document, and extracting a part of speech and a dependency relationship of each word;
A concept information extraction unit that extracts a concept vector that aggregates co-occurrence word vectors of each word as context information between words;
A disturbance value calculator for calculating a disturbance value to be added to the concept vector from at least one of the number or frequency of word pairs, the amount of documents, the number of word variations in the document, and the concept aggregation hierarchy of co-occurrence words; ,
Based on the corrected cosine similarity using the disturbance value, a synonym candidate estimation unit that estimates a word pair candidate having a synonym relationship as a synonym candidate;
A synonym candidate output unit that presents the estimated synonym word complement,
A synonym extraction system comprising at least

A word database that collects and accumulates information on part of speech and syntax of a word, and searches for and responds to information related to the part of speech and syntax of a word in response to an inquiry about a specific word from the word analysis unit. The synonym extraction system according to claim 1, further comprising a database.

A word database that collects and accumulates information on part of speech and syntax of a word, and searches for and responds to information related to the part of speech and syntax of a word in response to an inquiry about a specific word from the word analysis unit. A database,
A concept database that collects and accumulates general concept information that organizes general concepts of words, and in response to inquiries about specific words from the concept information extraction unit, general concept information related to the meaning and usage of words Searching and responding to said concept database; and
The synonym extraction system according to claim 2, further comprising:

The synonym extraction system for a document according to claim 1, wherein the document to be analyzed is a development document related to a specific project item.

The synonym extraction system according to claim 1 or 2, wherein the disturbance value calculated by the disturbance value calculation unit increases as the number of combinations in the document of the word pair increases.

The synonym extraction system according to claim 1 or 2, wherein the disturbance value calculated by the disturbance value calculation unit increases as the amount of the document to be analyzed increases.

A synonym extraction method for analyzing documents and extracting synonyms,
A document reception process for receiving input of a document to be analyzed;
Applying morphological analysis and syntactic analysis to each sentence and compound word in the document, and extracting word information and dependency relationship of each word, word information extraction step,
A co-occurrence information extraction step of extracting a co-occurrence word vector of each word as context information between words;
A disturbance value calculating step of calculating a disturbance value to be added to the co-occurrence word vector from at least one of the number of occurrences or the appearance frequency of word pairs, the amount of documents, and the number of variations of words in the document;
Based on the corrected cosine similarity using the disturbance value, a synonym candidate estimation step for estimating a word pair candidate having a synonym relationship as a synonym candidate;
A synonym candidate output step for presenting the estimated synonym candidates;
The synonym extraction method characterized by including at least.

A synonym extraction method for analyzing documents and extracting synonyms,
A document reception process for receiving input of a document to be analyzed;
Applying morphological analysis and syntactic analysis to each sentence and compound word in the document, and extracting word information and dependency relationship of each word, word information extraction step,
A concept information extracting step of extracting a concept vector in which co-occurrence word vectors of each word are aggregated as context information between words;
A disturbance value calculation step of calculating a disturbance value to be added to the concept vector from at least one of the number or frequency of word pairs, the amount of documents, the number of word variations in the document, and the concept aggregation hierarchy of co-occurrence words; ,
Based on the corrected cosine similarity using the disturbance value, a synonym candidate estimation step for estimating a word pair candidate having a synonym relationship as a synonym candidate;
A synonym candidate output step of presenting the estimated synonym word complement,
The synonym extraction method characterized by including at least.

10. The word information extracting step inquires a specific word from a word database that collects and stores information on part of speech and syntax of a word, and extracts information related to the part of speech and syntax of the word. The synonym extraction method described.

The word information extraction step inquires a specific word in a word database that collects and stores information on the part of speech and syntax of the word, extracts information related to the part of speech and syntax of the word,
The concept information extraction step inquires a specific word to a concept database that collects and accumulates general concept information that organizes general concepts of words, and extracts general concept information related to the meaning and usage of the word.
The synonym extraction method of Claim 9.

The synonym extraction method for a document according to claim 8 or 9, wherein the document to be analyzed is a development document relating to a specific project item.

The synonym extraction method according to claim 8 or 9, wherein the disturbance value calculated by the disturbance value calculation step increases as the number of combinations in the document of the word pair increases.

The synonym extraction method according to claim 8 or 9, wherein the disturbance value calculated by the disturbance value calculation step increases as the amount of document to be analyzed increases.

A computer-readable recording medium that records a synonym extraction program that causes a computer to analyze a document and extract synonyms,
A document acceptance procedure for accepting input of a document to be analyzed;
Applying morphological analysis and syntax analysis to each sentence and compound word in the document, and extracting word information and dependency relationship of each word, word information extraction procedure;
A co-occurrence information extraction procedure for extracting a co-occurrence word vector of each word as context information between words;
A disturbance value calculation procedure for calculating a disturbance value to be added to the co-occurrence word vector from at least one of the number of occurrences or the appearance frequency of word pairs, the amount of documents, and the number of variations of words in the document;
A synonym candidate estimation procedure for estimating a word pair candidate having a synonym relationship as a synonym candidate based on the corrected cosine similarity using the disturbance value;
A synonym candidate output procedure for presenting the estimated synonym candidates;
A computer-readable recording medium on which is recorded a synonym extraction program that executes at least.

A computer-readable recording medium that records a synonym extraction program that causes a computer to analyze a document and extract synonyms,
A document acceptance procedure for accepting input of a document to be analyzed;
Applying morphological analysis and syntax analysis to each sentence and compound word in the document, and extracting word information and dependency relationship of each word, word information extraction procedure;
A concept information extraction procedure for extracting a concept vector obtained by aggregating co-occurrence word vectors of each word as context information between words;
A disturbance value calculation procedure for calculating a disturbance value to be added to the concept vector from at least one of the number or frequency of word pairs, the amount of documents, the number of word variations in the document, and the concept aggregation hierarchy of co-occurrence words; ,
A synonym candidate estimation procedure for estimating a word pair candidate having a synonym relationship as a synonym candidate based on the corrected cosine similarity using the disturbance value;
A synonym candidate output procedure for presenting the estimated synonym word complement;
A computer-readable recording medium on which is recorded a synonym extraction program that executes at least.

17. The word information extraction procedure according to claim 15 or 16, wherein the word database that collects and stores information on part of speech and syntax of a word is queried for a specific word and information related to the part of speech and syntax of the word is extracted. The computer-readable recording medium which recorded the synonym extraction program of description.

The word information extraction procedure inquires a specific word in a word database that collects and accumulates information on the part of speech and syntax of a word, extracts information related to the part of speech and syntax of the word,
The concept information extraction procedure inquires a specific word to a concept database that collects and accumulates general concept information that organizes general concepts of words, and extracts general concept information related to the meaning and usage of words.
The computer-readable recording medium which recorded the synonym extraction program of Claim 16.

The computer-readable recording medium recording the synonym extraction program in a document according to claim 15 or 16, wherein the document to be analyzed is a development document relating to a specific project item.

The computer-readable recording recorded synonym extraction program according to claim 15 or 16, wherein the disturbance value calculated by the disturbance value calculation procedure increases as the number of combinations in the document of the word pair increases. Possible recording media.

17. The computer-readable record recording the synonym extraction program according to claim 15 or 16, wherein the disturbance value calculated by the disturbance value calculation procedure increases as the amount of the document to be analyzed increases. Medium.