JP6737151B2

JP6737151B2 - Synonym expression extraction device, synonym expression extraction method, and synonym expression extraction program

Info

Publication number: JP6737151B2
Application number: JP2016230635A
Authority: JP
Inventors: 育昌鄭; 友樹長瀬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-11-28
Filing date: 2016-11-28
Publication date: 2020-08-05
Anticipated expiration: 2036-11-28
Also published as: JP2018088101A

Description

本発明は、同義表現抽出装置、同義表現抽出方法、及び同義表現抽出プログラムに関する。 The present invention relates to a synonym expression extraction device, a synonym expression extraction method, and a synonym expression extraction program.

文書データから同義語を抽出する技術の１つとして、文書データから抽出した同義語の候補に対し、文脈共起と表記編集距離とに基づいて候補間の表記類似度を算出し、当該表記類似度に基づいて同義であるか否かを判定する方法が知られている（例えば、特許文献１を参照）。この種の同義語の抽出方法では、特定の分野で使用される単語や複合名詞の同義語を抽出することが可能である。 As one of the techniques for extracting synonyms from document data, for the synonym candidates extracted from the document data, the notation similarity between the candidates is calculated based on the context co-occurrence and the notation editing distance, and the notation similarity is calculated. A method of determining whether or not they have the same meaning based on the degree is known (for example, see Patent Document 1). With this type of synonym extraction method, it is possible to extract words used in a specific field or synonyms of compound nouns.

国際公開第２０１４／００２７７６号International Publication No. 2014/002776

上記の同義語の抽出方法では、複合名詞のペアが同義であるか否かを判定する際に、一方の複合名詞の表記と、他方の複合名詞の表記との類似度に基づいて判定する。このため、実際には同義である複合名詞のペアにおける複合名詞の表記同士に重複している部分が少ないと、該複合名詞のペアは同義表現ではないと誤判定することがある。 In the above synonym extraction method, when determining whether a pair of compound nouns is synonymous, it is determined based on the similarity between the notation of one compound noun and the notation of the other compound noun. Therefore, if there are few overlapping parts in the notation of compound nouns in a pair of compound nouns that are actually synonymous, it may be erroneously determined that the pair of compound nouns are not synonymous expressions.

１つの側面において、本発明は、文書データから複合名詞の同義表現を精度良く抽出することを目的とする。 In one aspect, the present invention aims to accurately extract synonymous expressions of compound nouns from document data.

１つの態様である同義表現抽出装置は、単語ペア設定部と、意味類似度学習部と、単語同義判定部と、複合名詞同義判定部と、を備える。単語ペア設定部は、文書データから抽出した複合名詞のペアを複数の単語ペアに分割し、同義である単語ペアが登録された同義単語辞書を参照して、複数の単語ペアを同義単語ペアと、同義であるか否かが確定していない未確定単語ペアと同定にする。意味類似度学習部は、未確定単語ペアと、複合名詞のペアにおける同義単語ペアを含む、文書データ内の複数の同義単語ペアとのそれぞれに対し、単語間の意味類似度を学習する処理を複数回行う。単語同義判定部は、未確定単語ペアの意味類似度の学習結果と、複数の同義単語ペアのそれぞれにおける意味類似度の学習結果とに基づいて、未確定単語ペアの単語同士が同義であるか否かを判定する。複合名詞同義判定部は、複合名詞のペアにおける複数の単語ペアが全て同義単語ペアである場合に、該複合名詞のペアを同義表現であると判定する。この同義表現抽出装置における意味類似度学習部は、処理対象である複数の単語ペアのそれぞれに対する意味類似度の学習処理を毎に、当該学習処理に用いる事例を追加する。また、単語同義判定部は、未確定単語ペアの意味類似度の学習結果における学習処理の回数と意味類似度との関係と、前記同義単語ペアの意味類似度の学習結果における学習処理の回数と意味類似度との関係とについての相関係数を算出する。単語同義判定部は、算出した当該相関係数に基づいて、未確定単語ペアの単語同士が同義であるか否かを判定する。 A synonym expression extraction device, which is one aspect, includes a word pair setting unit, a semantic similarity learning unit, a word synonym determination unit, and a compound noun synonym determination unit. The word pair setting unit divides the pair of compound nouns extracted from the document data into a plurality of word pairs, refers to the synonym word dictionary in which synonymous word pairs are registered, and identifies the plurality of word pairs as synonymous word pairs. , It is identified as an undetermined word pair whose synonyms have not been determined. The semantic similarity learning unit performs processing for learning the semantic similarity between words for each of an undetermined word pair and a plurality of synonymous word pairs in document data including synonymous word pairs in a pair of compound nouns. Do multiple times. The word synonym determination unit determines whether the words of the undetermined word pair are synonymous with each other based on the learning result of the semantic similarity of the undetermined word pair and the learning result of the semantic similarity of each of the plurality of synonymous word pairs. Determine whether or not. The compound noun synonym determination unit determines that the compound noun pair is a synonymous expression when all the plurality of word pairs in the compound noun pair are synonymous word pairs. The semantic similarity learning unit in the synonymous expression extraction device adds a case to be used for the learning processing for each learning processing of the semantic similarity for each of a plurality of word pairs to be processed. Further, the word synonym determination unit, the relationship between the number of learning processing and the semantic similarity in the learning result of the semantic similarity of the undetermined word pair, and the number of learning processing in the learning result of the semantic similarity of the synonymous word pair and A correlation coefficient regarding the relationship with the semantic similarity is calculated. The word synonym determination unit determines whether or not the words of the undetermined word pair are synonymous based on the calculated correlation coefficient.

上述の態様によれば、文書データから複合名詞の同義表現を精度良く抽出することが可能となる。 According to the above aspect, it is possible to accurately extract synonymous expressions of compound nouns from document data.

第１の実施形態に係る同義表現抽出装置の機能的構成を示す図である。It is a figure which shows the functional structure of the synonymous expression extraction apparatus which concerns on 1st Embodiment. 同義単語辞書の例を示す図である。It is a figure which shows the example of a synonym word dictionary. 複合名詞のペアの例と同義であるか否かの判定方法とを説明する図である。It is a figure explaining the example of the determination method of whether it is synonymous with the example of a pair of compound nouns. 第１の実施形態に係る同義表現抽出装置が行う処理を説明するフローチャートである。It is a flow chart explaining processing which a synonym expression extraction device concerning a 1st embodiment performs. 同義複合名詞特定処理の内容を説明するフローチャートである。It is a flow chart explaining the contents of synonymous compound noun identification processing. 意味類似度学習処理の内容を説明するフローチャートである。It is a flow chart explaining the contents of semantic similarity learning processing. 単語ペアについての同義判定処理の内容を説明するフローチャート（その１）である。It is a flowchart (the 1) explaining the content of the synonym determination processing about a word pair. 単語ペアについての同義判定処理の内容を説明するフローチャート（その２）である。It is a flowchart (the 2) explaining the content of the synonym determination processing about a word pair. 複合名詞のペアについての同義判定処理の内容を説明するフローチャートである。It is a flowchart explaining the content of the synonym determination process about the pair of compound nouns. 文字列と形態素解析の結果との例を示す図である。It is a figure which shows the example of a character string and the result of morphological analysis. 複合名詞のペアの抽出結果と単語ペアのリストとの例を示す図である。It is a figure which shows the example of the extraction result of the pair of a compound noun, and the list of word pairs. 類似度推移テーブルの例を示す図である。It is a figure which shows the example of a similarity transition table. 未確定単語ペアが同義表現であるか否かの判定方法を説明する図である。It is a figure explaining the determination method of whether an undetermined word pair is a synonymous expression. 学習処理の回数の決定方法を説明するグラフ図である。It is a graph explaining the method of determining the number of learning processes. 第２の実施形態に係る同義表現抽出装置の機能的構成を示す図である。It is a figure which shows the functional structure of the synonymous expression extraction apparatus which concerns on 2nd Embodiment. 第２の実施形態に係る同義複合名詞特定処理の内容を説明するフローチャートである。It is a flow chart explaining the contents of synonymous compound noun specific processing concerning a 2nd embodiment. 判定閾値の設定処理の内容を説明するフローチャートである。7 is a flowchart illustrating the content of determination threshold setting processing. 判定閾値の設定方法の具体例を説明する図である。It is a figure explaining the specific example of the setting method of a determination threshold value. 第３の実施形態に係る同義語辞書作成システムのシステム構成を示す図である。It is a figure which shows the system structure of the synonym dictionary creation system which concerns on 3rd Embodiment. 第４の実施形態に係る文書書換システムのシステム構成を示す図である。It is a figure which shows the system configuration of the document rewriting system which concerns on 4th Embodiment. 文書データ書換装置の機能的構成を示す図である。It is a figure which shows the functional structure of a document data rewriting device. 文書データ書換装置が行う処理を説明するフローチャートである。9 is a flowchart illustrating a process performed by the document data rewriting device. コンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a computer.

［第１の実施形態］
図１は、第１の実施形態に係る同義表現抽出装置の機能的構成を示す図である。 [First Embodiment]
FIG. 1 is a diagram showing a functional configuration of a synonym expression extraction device according to the first exemplary embodiment.

図１に示すように、本実施形態に係る同義表現抽出装置１は、文字列抽出部１１０と、形態素解析部１２０と、複合名詞抽出部１３０と、同義複合名詞特定部１４０と、を含む。また、同義表現抽出装置１は、文書集合１９１と、解析用辞書１９２と、解析結果コーパス１９３と、同義単語辞書１９４と、類似度推移テーブル１９５と、同義表現リスト１９６とを記憶する記憶部（図示せず）を備える。 As shown in FIG. 1, the synonym expression extraction device 1 according to the present embodiment includes a character string extraction unit 110, a morpheme analysis unit 120, a compound noun extraction unit 130, and a synonymous compound noun identification unit 140. Further, the synonym expression extraction device 1 stores a document set 191, an analysis dictionary 192, an analysis result corpus 193, a synonym word dictionary 194, a similarity transition table 195, and a synonym expression list 196 (a storage unit ( (Not shown).

文字列抽出部１１０は、文書集合１９１から文字列を抽出する。文書集合１９１は、複合名詞の同義表現の抽出に用いる複数の文書データを含む。文書集合１９１の複数の文書データは、それぞれ、例えば、特定の技術分野、特定の業種等で使用される用語を含む。文書集合１９１は、例えば、図示しない入力受付部を介して同義表現抽出装置１の記憶部に記憶させる。 The character string extraction unit 110 extracts a character string from the document set 191. The document set 191 includes a plurality of document data used for extracting synonymous expressions of compound nouns. Each of the plurality of pieces of document data of the document set 191 includes, for example, a term used in a specific technical field, a specific industry, or the like. The document set 191 is stored in the storage unit of the synonymous expression extraction device 1 via an input reception unit (not shown), for example.

形態素解析部１２０は、解析用辞書１９２を参照し、文字列抽出部１１０で抽出した文字列に対する形態素解析を行う。解析用辞書１９２は、形態素解析に使用する単語情報を含む。形態素解析部１２０は、形態素解析の結果を解析結果コーパス１９３に登録するとともに、複合名詞抽出部１３０に渡す。 The morpheme analysis unit 120 refers to the analysis dictionary 192 and performs morpheme analysis on the character string extracted by the character string extraction unit 110. The analysis dictionary 192 includes word information used for morphological analysis. The morpheme analysis unit 120 registers the result of the morpheme analysis in the analysis result corpus 193 and passes it to the compound noun extraction unit 130.

複合名詞抽出部１３０は、形態素解析の結果に基づいて、文字列内の複合名詞を抽出する。例えば、複合名詞抽出部１３０は、文字列内における品詞の並び順に基づいて、複合名詞の条件を満たす単語列を抽出する。また、複合名詞抽出部１３０は、抽出した複合名詞のなかから、同義表現である可能性がある複合名詞のペアを抽出する。 The compound noun extraction unit 130 extracts a compound noun in the character string based on the result of the morphological analysis. For example, the compound noun extraction unit 130 extracts a word string that satisfies the compound noun condition based on the order of the parts of speech in the character string. Further, the compound noun extraction unit 130 extracts a pair of compound nouns that may be synonymous expressions from the extracted compound nouns.

同義複合名詞特定部１４０は、解析結果コーパス１９３と、同義単語辞書１９４とに基づいて、複合名詞抽出部１３０で抽出した複合名詞のペアのなかから同義表現である複合名詞のペアを特定する。同義単語辞書１９４は、同義である単語のペアが複数登録されている。同義複合名詞特定部１４０は、同義表現である可能性がある複合名詞のペアについて、各複合名詞を単語に分割して単語ペアを生成し、同義単語辞書１９４を参照して単語ペアの単語同士が同義であるか否かを判定する。複合名詞のペアにおける単語ペアのなかに同義であるか否かが未確定の単語ペアがある場合、同義複合名詞特定部１４０は、形態素解析の結果から特定される全ての単語ペアに対し意味類似度の学習を行う。以下の説明では、単語同士が同義である単語ペアを同義単語ペアといい、単語同士が同義であるか否かが未確定の単語ペアを未確定単語ペアという。 Based on the analysis result corpus 193 and the synonym word dictionary 194, the synonymous compound noun specifying unit 140 specifies a pair of compound nouns that are synonymous expressions from the pair of compound nouns extracted by the compound noun extracting unit 130. In the synonym word dictionary 194, a plurality of synonymous word pairs are registered. The synonymous compound noun specifying unit 140 divides each compound noun into words for a pair of compound nouns that may be synonymous expressions, generates a word pair, and refers to the synonym word dictionary 194 to refer to the words of the word pair. Is synonymous with each other. When there is a word pair whose synonym is undetermined among the word pairs in the compound noun pair, the synonymous compound noun specifying unit 140 has the meaning similarity to all the word pairs specified from the result of the morphological analysis. Learn degree. In the following description, a word pair in which the words are synonymous is called a synonymous word pair, and a word pair in which it is undetermined whether the words are synonymous is called an undetermined word pair.

意味類似度の学習を行う場合、同義複合名詞特定部１４０は、複数の事例による意味類似度の学習処理（算出処理）を複数回行う。この際、同義複合名詞特定部１４０は、１回の学習処理が終わる毎に、複数の事例を追加して次の学習処理を行う。ここで、１個の事例は、例えば、１個の文字列に対する形態素解析の結果とする。同義複合名詞特定部１４０は、意味類似度の学習処理が１回終わる毎に、学習結果を類似度推定テーブル１９５に格納する。その後、同義複合名詞特定部１４０は、未確定単語ペアにおける事例数（処理回数）と意味類似度との関係と、同義単語ペアにおける事例数と意味類似度との関係と、に基づいて、未確定単語ペアの単語同士が同義であるか否かを判定する。未確定単語ペアの単語同士が同義である場合、同義複合名詞特定部１４０は、当該未確定単語ペアを同義単語ペアに変更する。未確定単語ペアが同義単語ペアであるか否かの判定を行った後、同義複合名詞特定部１４０は、同義表現である可能性がある複合名詞のペアにおける複数の単語ペアが全て同義単語ペアであるか否かを判定する。複合名詞のペアにおける複数の単語ペアが全て同義単語ペアである場合、同義複合名詞特定部１４０は、当該複数の単語ペアを含む複合名詞のペアにおける複合名詞同士を同義表現であると判定する。同義複合名詞特定部１４０は、同義表現であると判定した複合名詞のペアを同義表現リスト１９６に登録する。 When learning the semantic similarity, the synonymous compound noun identifying unit 140 performs the learning processing (calculation processing) of the semantic similarity based on a plurality of cases multiple times. At this time, the synonymous compound noun specifying unit 140 adds a plurality of cases and performs the next learning process every time one learning process is completed. Here, one case is, for example, the result of morphological analysis for one character string. The synonymous compound noun specifying unit 140 stores the learning result in the similarity estimation table 195 every time the learning processing of the semantic similarity is completed. After that, the synonymous compound noun specifying unit 140 determines whether or not the number of cases (the number of times of processing) in the undetermined word pair and the semantic similarity degree and the relationship between the number of cases and the semantic similarity degree in the synonymous word pair are undetermined. It is determined whether the words of the fixed word pair have the same meaning. When the words of the undetermined word pair are synonymous, the synonymous compound noun specifying unit 140 changes the undetermined word pair to a synonymous word pair. After determining whether or not the undetermined word pair is a synonymous word pair, the synonymous compound noun specifying unit 140 determines that all of the plurality of word pairs in the synonymous expression pair are synonymous word pairs. Or not. When all the plurality of word pairs in the pair of compound nouns are synonymous word pairs, the synonymous compound noun specifying unit 140 determines that the compound nouns in the pair of compound nouns including the plurality of word pairs are synonymous expressions. The synonymous compound noun specifying unit 140 registers the pair of compound nouns determined to be synonymous expressions in the synonymous expression list 196.

上記の機能を持つ同義複合名詞特定部１４０は、単語ペア設定部１４１と、類似度推移テーブル作成部１４２と、意味類似度学習部１４３と、単語同義判定部１４４と、複合名詞同義判定部１４５と、を含む。 The synonymous compound noun specifying unit 140 having the above-described function includes a word pair setting unit 141, a similarity transition table creating unit 142, a semantic similarity learning unit 143, a word synonym determining unit 144, and a compound noun synonymous determining unit 145. And, including.

単語ペア設定部１４１は、同義表現の可能性がある複合名詞のペアにおける各複合名詞を単語に分割して複数の単語ペアを生成し、該複数の単語ペアを未確定単語ペアと同義単語ペアとに分類する。単語ペア設定部１４１は、同義単語辞書１９４を参照し、複数の単語ペアを未確定単語ペアと、同義単語ペアとに分類する。更に、単語ペア設定部１４１は、解析結果コーパス１９３と同義単語辞書１９４とを参照し、文書集合１９１における、同義表現の可能性がある複合名詞のペアに含まれる単語とは異なる単語のなかから同義単語ペアを抽出する。単語ペア設定部１４１は、上記の未確定単語ペアと、同義単語ペアとを、複合名詞のペアにおける複合名詞同士が同義表現であるか否かの判定に用いる単語ペアに設定する。 The word pair setting unit 141 divides each compound noun in a pair of compound nouns that may have synonymous expressions into words to generate a plurality of word pairs, and the plurality of word pairs are defined as undetermined word pairs and synonymous word pairs. Classify into and. The word pair setting unit 141 refers to the synonym word dictionary 194 and classifies a plurality of word pairs into an undetermined word pair and a synonymous word pair. Furthermore, the word pair setting unit 141 refers to the analysis result corpus 193 and the synonym word dictionary 194, and determines from among the words different from the words included in the pair of compound nouns with the possibility of synonymous expression in the document set 191. Extract synonymous word pairs. The word pair setting unit 141 sets the undetermined word pair and the synonymous word pair as a word pair used for determining whether the compound nouns in the pair of compound nouns are synonymous expressions.

類似度推移テーブル作成部１４２は、未確定単語ペア及び全ての同義単語ペアに対する意味類似度の学習結果を保存した類似度推移テーブル１９５を生成する。類似度推移テーブル算出部１４２は、意味類似度学習部１４３に単語ペアの意味類似度を学習させ（算出させ）、該単語ペアの意味類似度の学習結果を類似度推移テーブル１９５に格納する。単語ペアの総数がＮ個であり、意味類似度の学習処理の実行回数がＭ回である場合、類似度推移テーブル作成部１４２は、例えば、Ｎ×Ｍセルの類似度推移テーブル１９５を生成する。 The similarity transition table creation unit 142 generates the similarity transition table 195 that stores the learning results of the semantic similarity for the undetermined word pairs and all synonymous word pairs. The similarity transition table calculation unit 142 causes the semantic similarity learning unit 143 to learn (calculate) the semantic similarity of the word pair, and stores the learning result of the semantic similarity of the word pair in the similarity transition table 195. When the total number of word pairs is N and the number of times of performing the semantic similarity learning process is M, the similarity transition table creation unit 142 generates, for example, a similarity transition table 195 of N×M cells. ..

意味類似度学習部１４３は、解析結果コーパス１９３に記憶させた形態素解析の結果（事例）に基づいて該事例の単語についての意味ベクトルを生成し、生成した意味ベクトルに基づいて単語ペアの意味類似度を算出する（学習する）処理をＭ回繰り返す。意味類似度学習部１４３は、単語ペアの意味類似度を算出する処理を１回終える毎に、意味類似度の算出に用いる事例をＨ個追加する。例えば、意味類似度学習部１４３は、ｍ回目（ｍ＝１，２，・・・，Ｍ）の処理における事例数ＨｔをＨｔ＝ｍ×Ｈとする。 The semantic similarity learning unit 143 generates a semantic vector for the word of the case based on the result (case) of the morphological analysis stored in the analysis result corpus 193, and based on the generated semantic vector, the semantic similarity of the word pair. The process of calculating (learning) the degree is repeated M times. The semantic similarity learning unit 143 adds H cases to be used for calculating the semantic similarity each time the process of calculating the semantic similarity of a word pair is completed. For example, the semantic similarity learning unit 143 sets the number of cases Ht in the m-th (m=1, 2,..., M) process to Ht=m×H.

単語同義判定部１４４は、意味類似度の学習結果（類似度推移テーブル１９５）における、同義単語ペアについての意味類似度の推移と、未確定単語ペアについての意味類似度の推移とに基づいて、該未確定単語ペアの単語同士が同義表現であるか否かを判定する。単語同義判定部１４４は、同義である単語ペアについての意味類似度の推移と、未確定単語ペアについての意味類似度の推移との相関係数が閾値以上である場合に、該未確定単語ペアの単語同士が同義表現であると判定する。 The word synonym determination unit 144, based on the transition of the semantic similarity regarding the synonymous word pair and the transition of the semantic similarity regarding the undetermined word pair in the learning result of the semantic similarity (similarity transition table 195), It is determined whether or not the words of the undetermined word pair are synonymous expressions. The word synonym determination unit 144 determines, when the correlation coefficient between the transition of the semantic similarity regarding the synonymous word pair and the transition of the semantic similarity regarding the undetermined word pair is equal to or more than a threshold value, the undetermined word pair. The words are determined to be synonymous expressions.

複合名詞同義判定部１４５は、単語同義判定部１４４の結果に基づいて、同義表現の可能性がある複合名詞のペアの複合名詞同士が同義表現であるか否かを判定する。複合名詞同義判定部１４５は、複合名詞のペアにおける複数の単語ペアが全て同義ペアである場合に、該複合名詞のペアの複合名詞同士が同義表現であると判定する。複合名詞同義判定部１４５は、同義表現であると判定した複合名詞のペアを同義表現リスト１９６に登録する。 Based on the result of the word synonym determination unit 144, the compound noun synonym determination unit 145 determines whether or not the compound nouns of the pair of compound nouns that are likely to be synonymous are synonymous expressions. The compound noun synonym determination unit 145 determines that the compound nouns of the pair of compound nouns are synonymous expressions when all the word pairs in the pair of compound nouns are synonymous pairs. The compound noun synonym determination unit 145 registers the pair of compound nouns determined to be synonymous expressions in the synonym expression list 196.

図２は、同義単語辞書の例を示す図である。
図２に示すように、同義単語辞書１９４は、同義表現である単語のペア（同義単語ペア）が複数組登録されている。同義単語辞書１９４に登録する単語の種類や同義単語ペアの組数Ｋは適宜設定可能である。例えば、同義単語辞書１９４に登録する同義単語ペアは、文書集合１９１に含まれる文書の分野において出現頻度の高い代表的な同義単語ペアであってもよいし、文書集合１９１に含まれる文書の分野とは無関係に、無作為に抽出した同義単語ペアであってもよい。 FIG. 2 is a diagram showing an example of a synonym word dictionary.
As shown in FIG. 2, the synonym word dictionary 194 stores a plurality of pairs of words that are synonymous expressions (synonymous word pairs). The types of words registered in the synonym word dictionary 194 and the number K of synonymous word pairs can be set as appropriate. For example, the synonym word pair registered in the synonym word dictionary 194 may be a representative synonym word pair having a high appearance frequency in the field of documents included in the document set 191, or the field of documents included in the document set 191. It may be a synonymous word pair that is randomly extracted regardless of.

同義単語辞書１９４は、上記のように、同義表現の可能性がある複合名詞のペアにおける複合名詞同士が同義であるか否かを判定する際に用いる。 As described above, the synonym word dictionary 194 is used when determining whether or not compound nouns in a pair of compound nouns that may be synonymous are synonymous.

図３は、複合名詞のペアの例と同義であるか否かの判定方法とを説明する図である。
図３には、複合名詞のペアの一例として、「運賃計算モジュール」という第１の複合名詞（単語列）２と、「交通費精算機能」という第２の複合名詞（単語列）３とを示している。第１の複合名詞２は、「運賃」という第１の単語２０１と、「計算」という第２の単語２０２と、「モジュール」という第３の単語２０３とを組み合わせた複合名詞である。一方、第２の複合名詞３は、「交通費」という第１の単語３０１と、「精算」という第２の単語３０２と、「機能」という第３の単語３０３とを組み合わせた複合名詞である。 FIG. 3 is a diagram illustrating an example of a pair of compound nouns and a method of determining whether they have the same meaning.
In FIG. 3, as an example of a pair of compound nouns, a first compound noun (word string) 2 called “fare calculation module” and a second compound noun (word string) 3 called “transportation expense settlement function” are shown. Showing. The first compound noun 2 is a compound noun that combines the first word 201 “fare”, the second word 202 “calculation”, and the third word 203 “module”. On the other hand, the second compound noun 3 is a compound noun that combines the first word 301 “transportation expense”, the second word 302 “settlement”, and the third word 303 “function”. ..

例えば、上記の第１の複合名詞２及び第２の複合名詞３は、運賃（交通費）を計算するアプリケーションソフトウェアに関する分野で用いられる複合名詞である。しかしながら、該分野についての文書集合１９１における第１の複合名詞２の出現頻度は、第１の複合名詞２における各単語２０１〜２０３の出現頻度よりも低いことが多い。同様に、該分野についての文書集合１９１における第２の複合名詞３の出現頻度は、第２の複合名詞３における各単語３０１〜３０３の出現頻度よりも低いことが多い。このため、文書集合１９１における第１の複合名詞２と第２の複合名詞３との出現頻度や文脈類似度等に基づいて、該２個の複合名詞が同義であるか否かを精度良く判定することは難しい。 For example, the first compound noun 2 and the second compound noun 3 described above are compound nouns used in the field of application software for calculating a fare (transportation cost). However, the appearance frequency of the first compound noun 2 in the document set 191 for the field is often lower than the appearance frequency of each of the words 201 to 203 in the first compound noun 2. Similarly, the appearance frequency of the second compound noun 3 in the document set 191 for the field is often lower than the appearance frequency of each of the words 301 to 303 in the second compound noun 3. Therefore, it is possible to accurately determine whether the two compound nouns are synonymous with each other based on the appearance frequency and the context similarity between the first compound noun 2 and the second compound noun 3 in the document set 191. Difficult to do.

これに対し、本実施形態では、図３に示すように、２個の複合名詞２，３をそれぞれ単語に分割して複数の単語ペアＷＰ１，ＷＰ２，及びＷＰ３を生成し、単語ペア毎に、同義単語辞書１９４を参照して同義であるか否かを判定する。図３に示した例では、まず、第１の複合名詞２における第１の単語２０１（「運賃」）と、第２の複合名詞３における第１の単語３０１（「交通費」）との単語ペアＷＰ１について同義であるか否かを判定する。図２に示した同義単語辞書１９４には、「運賃」と「交通費」との単語ペアが登録されている。このため、同義表現抽出装置１は、第１の複合名詞２における第１の単語２０１と、第２の複合名詞３における第１の単語３０１との単語ペアＷＰ１が同義であると判定する。同様に、同義表現抽出装置１は、第１の複合名詞２における第２の単語２０２（「計算」）と、第２の複合名詞３における第２の単語３０２（「精算」）との単語ペアＷＰ２が同義単語ペアであると判定する。 On the other hand, in the present embodiment, as shown in FIG. 3, the two compound nouns 2 and 3 are each divided into words to generate a plurality of word pairs WP1, WP2, and WP3, and for each word pair, By referring to the synonym word dictionary 194, it is determined whether or not they have the same meaning. In the example shown in FIG. 3, first, the words of the first word 201 (“fare”) in the first compound noun 2 and the first word 301 (“transportation cost”) of the second compound noun 3 It is determined whether the pair WP1 has the same meaning. In the synonym word dictionary 194 shown in FIG. 2, word pairs of “fare” and “transportation expense” are registered. Therefore, the synonymous expression extraction device 1 determines that the word pair WP1 of the first word 201 in the first compound noun 2 and the first word 301 in the second compound noun 3 is synonymous. Similarly, the synonymous expression extraction device 1 uses a word pair of the second word 202 (“calculation”) in the first compound noun 2 and the second word 302 (“settlement”) in the second compound noun 3. It is determined that WP2 is a synonymous word pair.

ここで更に、第１の複合名詞２における第３の単語２０３（「モジュール」）と、第２の複合名詞３における第３の単語３０３（「機能」）との単語ペアＷＰ３が、図２の同義単語辞書１９４に登録されていたとする。この場合、同義表現抽出装置１は、第１の複合名詞２における第３の単語２０３と、第２の複合名詞３における第３の単語３０３との単語ペアＷＰ３が同義であると判定する。このように複合名詞のペアにおける複数の単語ペアＷＰ１〜ＷＰ３が全て同義単語ペアであると判定した場合、同義表現抽出装置１は、該複合名詞のペアにおける複合名詞同士を同義であると判定する。 Here, further, the word pair WP3 of the third word 203 (“module”) in the first compound noun 2 and the third word 303 (“function”) in the second compound noun 3 is shown in FIG. It is assumed that the word is registered in the synonym word dictionary 194. In this case, the synonymous expression extraction device 1 determines that the word pair WP3 of the third word 203 in the first compound noun 2 and the third word 303 in the second compound noun 3 is synonymous. In this way, when it is determined that the plurality of word pairs WP1 to WP3 in the pair of compound nouns are all synonymous word pairs, the synonym expression extraction device 1 determines that the compound nouns in the pair of compound nouns are synonymous. ..

一方、「モジュール」と「機能」との単語ペアＷＰ３が図２の同義単語辞書１９４に登録されていない場合、同義表現抽出装置１は、図３に示した複合名詞のペアにおける単語ペアＷＰ３を未確定単語ペアと判定する。この場合、同義表現抽出装置１は、第１の複合名詞２と第２の複合名詞３とが同義表現であるか否かを判定するため、単語ペアＷＰ３が同義単語ペアであるか否かを判定する処理を行う。同義表現抽出装置１は、文書集合１９１から収集した複数の同義単語ペアのそれぞれについての意味類似度の学習結果の推移と、未確定単語ペアについての意味類似度の学習結果の推移とに基づいて、未確定単語ペアが同義単語ペアであるか否かを判定する。この処理により「モジュール」と「機能」との単語ペアＷＰ３が同義単語ペアであると判定した場合、同義表現抽出装置１は、第１の複合名詞２（「運賃計算モジュール」）と、第２の複合名詞３（「交通費精算機能」）とが同義表現であると判定する。 On the other hand, when the word pair WP3 of “module” and “function” is not registered in the synonym word dictionary 194 of FIG. 2, the synonym expression extraction device 1 determines the word pair WP3 in the pair of compound nouns shown in FIG. Judge as an undetermined word pair. In this case, the synonym expression extraction device 1 determines whether or not the word pair WP3 is a synonym word pair in order to determine whether or not the first compound noun 2 and the second compound noun 3 are synonymous expressions. Perform determination processing. The synonymous expression extraction device 1 is based on the transition of the learning result of the semantic similarity for each of the plurality of synonymous word pairs collected from the document set 191, and the transition of the learning result of the semantic similarity for an undetermined word pair. , It is determined whether the undetermined word pair is a synonymous word pair. When it is determined by this process that the word pair WP3 of “module” and “function” is a synonymous word pair, the synonymous expression extraction device 1 determines the first compound noun 2 (“fare calculation module”) and the second compound noun 2 The compound noun 3 (“Transportation expense settlement function”) is determined to be a synonymous expression.

以下、図４〜図８を参照して、本実施形態に係る同義表現抽出装置１が行う処理を説明する。 Hereinafter, the processing performed by the synonymous expression extraction device 1 according to the present exemplary embodiment will be described with reference to FIGS. 4 to 8.

図４は、第１の実施形態に係る同義表現抽出装置が行う処理を説明するフローチャートである。 FIG. 4 is a flowchart illustrating a process performed by the synonymous expression extraction device according to the first exemplary embodiment.

本実施形態の同義表現抽出装置１は、図４に示すように、まず、文書集合１９１から文字列を抽出する（ステップＳ１）。ステップＳ１の処理は、文字列抽出部１１０が行う。文字列抽出部１１０は、既知の抽出方法に従って、文書集合１９１に含まれる複数の文書データのそれぞれから文字列を抽出する。 As shown in FIG. 4, the synonymous expression extraction device 1 of this exemplary embodiment first extracts a character string from the document set 191 (step S1). The processing of step S1 is performed by the character string extraction unit 110. The character string extraction unit 110 extracts a character string from each of a plurality of pieces of document data included in the document set 191 according to a known extraction method.

次に、同義表現抽出装置１は、抽出した文字列の形態素解析を行う（ステップＳ２）。ステップＳ２の処理は、形態素解析部１２０が行う。形態素解析部１２０は、既知の解析方法に従い、解析用辞書１９２を参照して、抽出した複数の文字列のそれぞれに対する形態素解析を行う。形態素解析部１２０は、形態素解析の結果を、解析結果コーパス１９３に格納するとともに、複合名詞抽出部１３０に渡す。形態素解析部１２０は、文書集合１９１における文単位で形態素解析の結果を解析結果コーパス１９３に格納する。 Next, the synonymous expression extraction device 1 performs morphological analysis of the extracted character string (step S2). The process of step S2 is performed by the morpheme analysis unit 120. The morpheme analysis unit 120 performs a morpheme analysis on each of the extracted plurality of character strings by referring to the analysis dictionary 192 according to a known analysis method. The morpheme analysis unit 120 stores the result of the morpheme analysis in the analysis result corpus 193 and passes it to the compound noun extraction unit 130. The morpheme analysis unit 120 stores the result of the morpheme analysis for each sentence in the document set 191 in the analysis result corpus 193.

次に、同義表現抽出装置１は、形態素解析の結果に基づいて、文字列から複合名詞を抽出する（ステップＳ３）。ステップＳ３の処理は、複合名詞抽出部１３０が行う。複合名詞抽出部１３０は、各文字列に対する形態素解析の結果に基づいて、複合名詞の条件を満たす単語列を文字列から抽出する。 Next, the synonymous expression extraction device 1 extracts a compound noun from the character string based on the result of the morphological analysis (step S3). The process of step S3 is performed by the compound noun extraction unit 130. The compound noun extraction unit 130 extracts, from the character string, a word string that satisfies the condition of the compound noun based on the result of the morphological analysis on each character string.

次に、同義表現抽出装置１は、複数種類の複合名詞を抽出したか否かを判定する（ステップＳ４）。ステップＳ４の判定は、例えば、複合名詞抽出部１３０が行う。複合名詞を抽出しなかった場合、或いは抽出した複合名詞が１種類である場合（ステップＳ４；ＮＯ）、同義であるか否かを判定する複合名詞のペアが存在しないため、複合名詞抽出部１３０（同義表現抽出装置１）は、処理を終了する。 Next, the synonym expression extraction device 1 determines whether or not a plurality of types of compound nouns have been extracted (step S4). The determination of step S4 is performed by the compound noun extraction unit 130, for example. When the compound noun is not extracted, or when the extracted compound noun is one type (step S4; NO), there is no compound noun pair that determines whether or not they are synonymous, so the compound noun extraction unit 130 (Synonym expression extraction device 1) ends the process.

一方、複数種類の複合名詞を抽出した場合（ステップＳ４；ＹＥＳ）、同義表現抽出装置１は、次に、同義表現の可能性がある複合名詞のペアを同定する（ステップＳ５）。ステップＳ５の処理は、例えば、複合名詞抽出部１３０が行う。ステップＳ５において、複合名詞抽出部１３０は、既知の同定方法に従い、同義表現の可能性がある複合名詞のペアを同定する。例えば、複合名詞抽出部１３０は、複数通りの複合名詞のペアの組み合わせのそれぞれで複合名詞についての文脈類似度を算出し、該文脈類似度に基づいて、同義表現の可能性がある複合名詞のペアを同定する。 On the other hand, when a plurality of types of compound nouns are extracted (step S4; YES), the synonym expression extraction device 1 next identifies a pair of compound nouns that may be synonymous expressions (step S5). The process of step S5 is performed by the compound noun extraction unit 130, for example. In step S5, the compound noun extraction unit 130 identifies a pair of compound nouns that may be synonymous according to a known identification method. For example, the compound noun extraction unit 130 calculates a context similarity for a compound noun for each of a plurality of combinations of pairs of compound nouns, and based on the context similarity, a compound noun that may have a synonymous expression is extracted. Identify the pair.

次に、同義表現抽出装置１は、同義表現の可能性がある複合名詞のペアが存在するか否かを判定する（ステップＳ６）。ステップＳ６の判定は、例えば、複合名詞抽出部１３０が行う。同義表現の可能性がある複合名詞のペアが存在しない場合（ステップＳ６；ＮＯ）、同義であるか否かを判定する複合名詞のペアが存在しないため、複合名詞抽出部１３０（同義表現抽出装置１）は、１組の文書集合１９１に対する一連の処理を終了する。 Next, the synonymous expression extraction device 1 determines whether or not there is a pair of compound nouns that may be synonymous expressions (step S6). The determination in step S6 is performed by, for example, the compound noun extraction unit 130. When there is no pair of compound nouns that may have synonymous expressions (step S6; NO), there is no pair of compound nouns for determining whether or not they are synonymous, so the compound noun extracting unit 130 (synonymous expression extracting device). In 1), a series of processes for one set of documents 191 ends.

これに対し、同義表現の可能性がある複合名詞のペアが存在する場合（ステップＳ６；ＹＥＳ）、同義表現抽出装置１は、次に、同義複合名詞特定処理（ステップＳ７）を行う。ステップＳ７の処理は、同義複合名詞特定部１４０が行う。同義複合名詞特定部１４０は、上記のような処理を行い、同義表現である複合名詞のペアを特定する。同義表現である複合名詞のペアを特定した場合、同義複合名詞特定部１４０は、特定した複合名詞のペアを同義表現リスト１９６に登録する。同義表現の可能性がある複合名詞のペアに対する同義複合名詞特定処理（ステップＳ７）を終えると、同期表現抽出装置１は、１組の文書集合１９１に対する一連の処理を終了する。 On the other hand, when there is a pair of compound nouns that may be synonymous expressions (step S6; YES), the synonym expression extraction device 1 then performs synonymous compound noun identification processing (step S7). The process of step S7 is performed by the synonymous compound noun specifying unit 140. The synonymous compound noun specifying unit 140 performs the above-described processing to specify a pair of compound nouns that are synonymous expressions. When a pair of compound nouns that are synonymous expressions is specified, the synonym compound noun specifying unit 140 registers the specified pair of compound nouns in the synonym expression list 196. When the synonymous compound noun specifying process (step S7) for the pair of compound nouns that may have the synonymous expression is finished, the synchronous expression extracting device 1 ends the series of processes for one document set 191.

上記のように、ステップＳ７の処理（同義複合名詞特定処理）は、同義複合名詞特定部１４０が行う。同義複合名詞特定部１４０は、同義複合名詞特定処理として、例えば、図５に示す処理を行う。 As described above, the process of step S7 (synonymous compound noun specifying process) is performed by the synonymous compound noun specifying unit 140. The synonymous compound noun specifying unit 140 performs, for example, the process shown in FIG. 5 as the synonymous compound noun specifying process.

図５は、同義複合名詞特定処理の内容を説明するフローチャートである。
同義複合名詞特定処理において、同義複合名詞特定部１４０は、まず、同義表現の可能性がある複合名詞のペアにおける同義単語ペアと、未確定単語ペアとを同定する（ステップＳ７１）。ステップＳ７１の処理は、同義複合名詞特定部１４０の単語ペア設定部１４１が行う。単語ペア設定部１４１は、図３に示したように、同義表現の可能性がある複合名詞のペアを複数の単語ペアに分割し、単語ペア毎に同義単語ペアであるか、未確定単語ペアであるかを同定する。単語ペア設定部１４１は、同義単語辞書１９４を参照して、単語ペアが同義単語ペアであるか未確定単語ペアであるかを同定する。また、ステップＳ７１において、単語ペア設定部１４１は、例えば、複数の単語ペアがそれぞれ同義単語ペアであるか未確定単語ペアであるかを示す単語ペアリストを生成する。 FIG. 5 is a flowchart illustrating the contents of the synonymous compound noun specifying process.
In the synonymous compound noun specifying process, the synonymous compound noun specifying unit 140 first identifies a synonymous word pair and an undetermined word pair in a pair of compound nouns that may be synonymous expressions (step S71). The process of step S71 is performed by the word pair setting unit 141 of the synonymous compound noun specifying unit 140. As shown in FIG. 3, the word pair setting unit 141 divides a pair of compound nouns that may have synonymous expressions into a plurality of word pairs, and determines whether each word pair is a synonymous word pair or an undetermined word pair. Is identified. The word pair setting unit 141 refers to the synonym word dictionary 194 to identify whether the word pair is a synonymous word pair or an undetermined word pair. Further, in step S71, the word pair setting unit 141 generates, for example, a word pair list indicating whether each of the plurality of word pairs is a synonymous word pair or an undetermined word pair.

次に、同義複合名詞特定部１４０は、文書集合１９１の文字列から同義単語辞書１９４に登録されている同義表現の単語ペア（同義単語ペア）を収集する（ステップＳ７２）。ステップＳ７２の処理は、単語ペア設定部１４１が行う。単語ペア設定部１４１は、ステップＳ７２において収集した同義単語ペアのうちの、複合名詞のペアにおける同義単語ペアと重複していない同義単語ペアを、上記の単語ペアリストに追加する。 Next, the synonymous compound noun specifying unit 140 collects word pairs (synonymous word pairs) of synonymous expressions registered in the synonym word dictionary 194 from the character strings of the document set 191 (step S72). The process of step S72 is performed by the word pair setting unit 141. The word pair setting unit 141 adds, to the above word pair list, synonymous word pairs that do not overlap with the synonymous word pairs in the pair of compound nouns of the synonymous word pairs collected in step S72.

次に、同義複合名詞特定部１４０は、ステップＳ７１及びＳ７２で同定、収集した未確定単語ペア及び同義単語ペアを処理対象の単語ペアとして意味類似度学習処理（ステップＳ７３）を行う。ステップＳ７３の意味類似度学習処理は、同義複合名詞特定部１４０の類似度推移テーブル作成部１４２と、意味類似度学習部１４３とが行う。 Next, the synonymous compound noun specifying unit 140 performs the semantic similarity learning process (step S73) with the undetermined word pair and the synonymous word pair identified and collected in steps S71 and S72 as the processing target word pair. The semantic similarity learning process of step S73 is performed by the similarity transition table creation unit 142 of the synonymous compound noun specifying unit 140 and the semantic similarity learning unit 143.

ステップＳ７３において、類似度推移テーブル作成部１４２は、処理対象の単語ペアのそれぞれに対する意味類似度の学習結果を格納する類似度推移テーブル１９５を生成する。また、類似度推移テーブル算出部１４２は、意味類似度学習部１４３における処理対象の単語ペアのそれぞれについての意味類似度の学習結果を類似度推移テーブル１９５の所定の欄（セル）に格納する。処理対象の単語ペアの総数がＮ個であり、意味類似度の学習をＭ回行う場合、類似度推移テーブル作成部１４２は、例えば、Ｎ×Ｍ個の欄（セル）を持つ類似度推移テーブル１９５を生成する。また、ステップＳ７３において、意味類似度学習部１４３は、形態素解析の結果（事例）に基づいて単語ペアの各単語についての意味ベクトルを学習し、該意味ベクトルに基づいて単語ペアの単語同士の意味類似度を算出する処理を行う。この際、意味類似度学習部１４３は、１組の単語ペアに対し、上記の意味類似度を算出する処理をＭ回繰り返す。更に、意味類似度学習部１４３は、１組の単語ペアに対する意味類似度を算出する処理を１回終える毎に、意味類似度の算出に用いる事例数を追加する。例えば、意味類似度学習部１４３は、１組の単語ペアに対するｍ回目（ｍ＝１，２，・・・，Ｍ）の処理における事例数ＨｔをＨｔ＝ｍ×Ｈ（例えば、Ｈ＝１０００）とする。 In step S73, the similarity transition table creation unit 142 creates the similarity transition table 195 that stores the learning result of the semantic similarity for each word pair to be processed. Further, the similarity transition table calculation unit 142 stores the learning result of the semantic similarity for each word pair to be processed by the semantic similarity learning unit 143 in a predetermined column (cell) of the similarity transition table 195. When the total number of word pairs to be processed is N and learning of the semantic similarity is performed M times, the similarity transition table creation unit 142 may use, for example, a similarity transition table having N×M columns (cells). 195 is generated. In step S73, the semantic similarity learning unit 143 learns the meaning vector for each word of the word pair based on the result (case) of the morphological analysis, and the meaning of the words of the word pair based on the meaning vector. A process of calculating the degree of similarity is performed. At this time, the semantic similarity learning unit 143 repeats the process of calculating the semantic similarity M times for one word pair. Furthermore, the semantic similarity learning unit 143 adds the number of cases used for calculating the semantic similarity each time the process of calculating the semantic similarity for one word pair is completed. For example, the semantic similarity learning unit 143 sets the number of cases Ht in the m-th (m=1, 2,..., M) process for one word pair to Ht=m×H (for example, H=1000). And

ステップＳ７３の処理を終えると、同義複合名詞特定部１４０は、次に、未確定単語ペアについての同義判定処理（ステップＳ７４）を行う。ステップＳ７４の処理は、同義複合名詞特定部１４０の単語同義判定部１４４が行う。単語同義判定部１４４は、類似度推移テーブル１９５を参照して同義単語ペアの意味類似度の推移と、未確定単語ペアの意味類似度の推移との相関係数を算出し、当該相関係数が閾値以上である場合に未確定単語ペアの単語同士が同義であると判定する。単語同義判定部１４４は、単語同士が同義であると判定した未確定単語ペアを、同義単語ペアに変更する。 After finishing the process of step S73, the synonymous compound noun specifying unit 140 then performs the synonym determination process (step S74) for the undetermined word pair. The process of step S74 is performed by the word synonym determination unit 144 of the synonymous compound noun specifying unit 140. The word synonym determination unit 144 refers to the similarity transition table 195, calculates the correlation coefficient between the transition of the semantic similarity of the synonymous word pair and the transition of the semantic similarity of the undetermined word pair, and the correlation coefficient Is greater than or equal to the threshold, it is determined that the words in the undetermined word pair are synonymous. The word synonym determination unit 144 changes the undetermined word pair determined to be synonymous with each other into a synonymous word pair.

次に、同義複合名詞特定部１４０は、複合名詞のペアについての同義判定処理（ステップＳ７５）。ステップＳ７５の処理は、同義複合名詞特定部１４０の複合名詞同義判定部１４５が行う。複合名詞同義判定部１４５は、同義表現の可能性がある複合名詞のペアにおける複数の単語ペアのそれぞれが同義単語ペアであるか否かを判定する。複合名詞のペアにおける全ての単語ペアが同義単語ペアである場合、複合名詞同義判定部１４５は、該複合名詞のペアを同義表現の複合名詞のペアと判定し、該複合名詞のペアを同義表現リスト１９６に登録する。 Next, the synonym compound noun specifying unit 140 performs the synonym determination process for the pair of compound nouns (step S75). The process of step S75 is performed by the compound noun synonym determination unit 145 of the synonym compound noun identification unit 140. The compound noun synonym determination unit 145 determines whether or not each of the plurality of word pairs in the pair of compound nouns that may be synonymous is a synonymous word pair. When all the word pairs in the pair of compound nouns are synonymous word pairs, the compound noun synonym determination unit 145 determines that the pair of compound nouns is the pair of compound nouns of the synonymous expression, and the pair of compound nouns is the synonymous expression. Register in list 196.

ステップＳ７５の処理を終えると、同義複合名詞特定部１４０は、同義複合名詞特定処理（ステップＳ７）を終了する。 When the process of step S75 ends, the synonymous compound noun specifying unit 140 ends the synonymous compound noun specifying process (step S7).

同義複合名詞特定部１４０が行う同義複合名詞特定処理のうちの意味類似度学習処理（ステップＳ７３）は、上記のように、類似度推移テーブル作成部１４２と、意味類似度学習部１４３とが行う。類似度推移テーブル作成部１４２と、意味類似度学習部１４３とは、意味類似度学習処理として、例えば、図６に示した処理を行う。 The semantic similarity learning process (step S73) of the synonymous compound noun identifying process performed by the synonymous compound noun identifying unit 140 is performed by the similarity transition table creating unit 142 and the semantic similarity learning unit 143 as described above. .. The similarity transition table creation unit 142 and the semantic similarity learning unit 143 perform, for example, the process shown in FIG. 6 as the semantic similarity learning process.

図６は、意味類似度学習処理の内容を説明するフローチャートである。
意味類似度学習処理では、図６に示すように、まず、Ｎ組の単語ペアのそれぞれに対するＭ回の学習結果を格納する類似度推移テーブル１９５を用意する（ステップＳ７３０１）。ステップＳ７３０１の処理は、類似度推移テーブル作成部１４２が行う。類似度推移テーブル作成部１４２は、例えば、Ｎ×Ｍ個のデータ格納欄（セル）を持つテーブルを生成する。 FIG. 6 is a flowchart illustrating the contents of the semantic similarity learning process.
In the semantic similarity learning process, as shown in FIG. 6, first, a similarity transition table 195 that stores M learning results for each of N word pairs is prepared (step S7301). The process of step S7301 is performed by the similarity transition table creation unit 142. The similarity transition table creation unit 142 creates, for example, a table having N×M data storage columns (cells).

ステップＳ７３０１の処理の後、類似度推移テーブル作成部１４２と、意味類似度学習部１４３とは、第１のループ処理（ステップＳ７３０２〜Ｓ７３０９）をＭ回繰り返す。第１のループ処理における１回の処理は、第２のループ処理（ステップＳ７３０３〜Ｓ７３０８）となっている。第１のループ処理のループ端（ステップＳ７３０２，Ｓ７３０９）では、第２のループ処理の処理回数ｍをカウントする処理と、処理回数ｍがｍ≧Ｍであるか否かの判定とを行う。処理回数ｍのカウント及び判定は、類似度推移テーブル作成部１４２が行う。第２のループ処理をＭ回繰り返すと、類似度推移テーブル作成部１４２は、第１のループ処理を終了し、意味類似度学習処理を終了する。 After the process of step S7301, the similarity transition table creation unit 142 and the semantic similarity learning unit 143 repeat the first loop process (steps S7302 to S7309) M times. One processing in the first loop processing is the second loop processing (steps S7303 to S7308). At the loop end of the first loop process (steps S7302, S7309), a process of counting the number of times m of the second loop process and a determination of whether or not the number of processes m is m≧M are performed. The similarity transition table creation unit 142 counts and determines the number of times of processing m. When the second loop process is repeated M times, the similarity transition table creation unit 142 ends the first loop process and ends the semantic similarity learning process.

第２のループ処理における１回の処理（ステップＳ７３０４〜Ｓ７３０７）は、１組の単語ペアにおける各単語の意味ベクトルを学習して意味類似度を算出し、算出した意味類似度を類似度推移テーブル１９５に格納する処理となっている。第２のループ処理のループ端（ステップＳ７３０３，Ｓ７３０８）では、複数の単語ペアのうちの１組の単語ペアを選択する処理と、全ての単語ペアに対しステップＳ７３０４〜Ｓ７３０７の処理を行った否かの判定とを行う。第２のループ処理のループ端における選択及び判定は、類似度推移テーブル作成部１４２が行う。全ての単語ペアに対しステップＳ７３０４〜Ｓ７３０７の処理を行うと、類似度推移テーブル作成部１４２は、１回の第２のループ処理を終了する。 One process (steps S7304 to S7307) in the second loop process is performed by learning the semantic vector of each word in one set of word pairs to calculate the semantic similarity, and calculating the calculated semantic similarity as the similarity transition table. The processing is stored in 195. At the loop end of the second loop processing (steps S7303 and S7308), the processing of selecting one word pair out of the plurality of word pairs and the processing of steps S7304 to S7307 for all word pairs have been performed. Whether or not it is determined. The similarity transition table creation unit 142 performs selection and determination at the loop end of the second loop processing. When the processes of steps S7304 to S7307 have been performed for all word pairs, the similarity transition table creation unit 142 ends the second loop process once.

第２のループ処理における１回の処理では、まず、解析結果コーパス１９３からＨ個の形態素解析の結果を学習事例として取得する（ステップＳ７３０４）。ステップＳ７３０４の処理は、意味類似度学習部１４３が行う。意味類似度学習部１４３は、解析結果コーパス１９３に登録された形態素解析の結果のなかから、現在の意味類似度学習処理でまだ取得していない形態素解析の結果をＨ個取得する。なお、第２のループ処理が２回目〜Ｍ回目である場合、意味類似度学習部１４３は、取得したＨ個の学習事例を取得済みの学習事例に追加する。 In one processing in the second loop processing, first, the results of H morphological analysis are acquired from the analysis result corpus 193 as learning examples (step S7304). The processing of step S7304 is performed by the semantic similarity learning unit 143. The semantic similarity learning unit 143 acquires, from the results of the morphological analysis registered in the analysis result corpus 193, H results of the morphological analysis that have not yet been acquired by the current semantic similarity learning process. When the second loop process is the second to Mth times, the semantic similarity learning unit 143 adds the acquired H learning cases to the already acquired learning cases.

学習事例を取得した意味類似度学習部１４３は、続けて、取得した学習事例に基づいて単語ペアの意味ベクトルを学習し（ステップＳ７３０５）、学習した意味ベクトルに基づいて単語ペアの意味類似度を算出する（ステップＳ７３０６）。ステップＳ７３０５において、意味類似度学習部１４３は、既知の学習方法に従い、現在処理対象となっている１組の単語ペアにおける各単語についての意味ベクトルを学習する。また、ステップＳ７３０５において、意味類似度学習部１４３は、学習した各単語の意味ベクトルに基づいて単語ペアの意味類似度を算出する。意味類似度学習部１４３は、現在処理対象となっている単語ペアについて算出した意味類似度を類似度推移テーブル作成部１４２に送る。意味類似度の算出結果を受け取った類似度推移テーブル作成部１４２は、受け取った意味類似度の値を、類似度推移テーブル１９５に格納する（ステップＳ７３０７）。ステップＳ７３０７において、類似度推移テーブル作成部１４２は、類似度推移テーブル１９５のＮ×Ｍ個のデータ格納欄（セル）のうちの、現在処理対象となっている単語ペアと処理回数ｍとで特定されるセルに、意味類似度の算出結果を格納する。 The semantic similarity learning unit 143 that has acquired the learning case continues to learn the meaning vector of the word pair based on the acquired learning case (step S7305), and determines the semantic similarity of the word pair based on the learned meaning vector. It is calculated (step S7306). In step S7305, the semantic similarity learning unit 143 learns a semantic vector for each word in the currently processed one word pair according to a known learning method. Further, in step S7305, the semantic similarity learning unit 143 calculates the semantic similarity of the word pair based on the learned semantic vector of each word. The semantic similarity learning unit 143 sends the semantic similarity calculated for the word pair currently being processed to the similarity transition table creation unit 142. Upon receiving the calculation result of the semantic similarity, the similarity transition table creation unit 142 stores the received value of the semantic similarity in the similarity transition table 195 (step S7307). In step S7307, the similarity transition table creation unit 142 identifies the currently processed word pair and the number of times m of processing out of the N×M data storage columns (cells) of the similarity transition table 195. The calculated result of the semantic similarity is stored in the cell.

全ての単語ペアに対しステップＳ７３０４〜Ｓ７３０７の処理を行うと、類似度推移テーブル作成部１４２は、第２のループ処理を終了する。その後、類似度推移テーブル１４２は、第１のループ処理のループ端（ステップＳ７３０２又はＳ７３０９）において処理回数ｍをｍ＝ｍ＋１に更新し、ｍ≧Ｍとなるまで第２のループ処理を繰り返す。そして、処理回数ｍがｍ≧Ｍとなると、類似度推移テーブル作成部１４２は、第１のループ処理を終了し、意味類似度学習処理を終了する。 When the processes of steps S7304 to S7307 are performed on all the word pairs, the similarity transition table creation unit 142 ends the second loop process. After that, the similarity transition table 142 updates the number of processing times m to m=m+1 at the loop end of the first loop processing (step S7302 or S7309), and repeats the second loop processing until m≧M. Then, when the number of times of processing m becomes m≧M, the similarity transition table creation unit 142 ends the first loop processing and ends the semantic similarity learning processing.

類似度推移テーブル作成部１４２及び意味類似度学習部１４３による意味類似度学習処理（ステップＳ７３）が終了すると、同義複合名詞特定部１４０は、次に、単語ペアについての同義判定処理（ステップＳ７４）を行う。ステップＳ７４の処理は、同義複合名詞特定部１４０の単語同義判定部１４４が行う。単語同義判定部１４４は、ステップＳ７４の同義判定処理として、図７Ａ、及び図７Ｂに示す処理を行う。 When the semantic similarity learning processing (step S73) by the similarity transition table creation unit 142 and the semantic similarity learning unit 143 is completed, the synonymous compound noun specifying unit 140 then performs the synonym determination processing on the word pair (step S74). I do. The process of step S74 is performed by the word synonym determination unit 144 of the synonymous compound noun specifying unit 140. The word synonym determination unit 144 performs the processing shown in FIGS. 7A and 7B as the synonym determination processing in step S74.

図７Ａは、単語ペアについての同義判定処理の内容を説明するフローチャート（その１）である。図７Ｂは、単語ペアについての同義判定処理の内容を説明するフローチャート（その２）である。 FIG. 7A is a flowchart (part 1) explaining the content of the synonym determination process for a word pair. FIG. 7B is a flowchart (part 2) explaining the content of the synonym determination process for a word pair.

単語ペアについての同義判定処理において、単語同義判定部１４４は、図７Ａに示すように、まず、文書集合１９１から収集した単語ペアのリストと、類似度推移テーブル１９５とを取得する（ステップＳ７４０１）。 In the synonym determination process for a word pair, the word synonym determination unit 144 first acquires the list of word pairs collected from the document set 191, and the similarity transition table 195, as shown in FIG. 7A (step S7401). ..

次に、単語同義判定部１４４は、第１のループ処理（Ｓ７４０２〜Ｓ７４１１）を行う。第１のループ処理における１回の処理（ステップＳ７４０３〜Ｓ７４１０）は、第２のループ処理（ステップＳ７４０３〜Ｓ７４０６）と、第２のループ処理の後で行うステップＳ７４０７〜Ｓ７４１０の処理とを含む。第１のループ処理のループ端（ステップＳ７４０２，Ｓ７４１１）において、単語同義判定部１４４は、未確定単語ペアを選択する処理と、全ての未確定単語ペアに対してステップＳ７４０３〜Ｓ７４１０の処理を行ったか否かの判定とを行う。全ての未確定単語ペアに対してステップＳ７４０３〜Ｓ７４１０の処理を行うと、単語同義判定部１４４は、第１のループ処理を終了し、単語ペアについての同義判定処理を終了する。 Next, the word synonym determination unit 144 performs the first loop processing (S7402 to S7411). One processing (steps S7403 to S7410) in the first loop processing includes the second loop processing (steps S7403 to S7406) and the processing of steps S7407 to S7410 performed after the second loop processing. At the loop end of the first loop processing (steps S7402, S7411), the word synonym determination unit 144 performs the processing of selecting undetermined word pairs and the processing of steps S7403 to S7410 for all undetermined word pairs. Whether or not it is determined. When the processes of steps S7403 to S7410 are performed on all the undetermined word pairs, the word synonym determination unit 144 ends the first loop process and ends the synonym determination process for the word pair.

第２のループ処理における１回の処理（ステップＳ７４０４及びＳ７４０５）は、現在処理対象に選択されている未確定単語ペアと、１組の同義単語ペアとの意味類似度の推移の相関係数を算出する処理となっている。単語同義判定部１４４は、まず、類似度推移テーブル１９５から、現在処理対象である未確定単語ペアの意味類似度と、同義単語ペアの意味類似度とを取得する（ステップＳ７４０４）。続けて、単語同義判定部１４４は、取得した意味類似度同士の相関係数を算出する（ステップＳ７４０５）。ステップＳ７４０５において、単語同義判定部１４４は、例えば、下記式（１）により相関係数を算出する。 One processing (steps S7404 and S7405) in the second loop processing is performed by calculating the correlation coefficient of the transition of the semantic similarity between the undetermined word pair currently selected as the processing target and one synonymous word pair. It is a process of calculating. The word synonym determination unit 144 first acquires, from the similarity transition table 195, the semantic similarity of the undetermined word pair that is the current processing target and the semantic similarity of the synonymous word pair (step S7404). Subsequently, the word synonym determination unit 144 calculates the correlation coefficient between the acquired semantic similarities (step S7405). In step S7405, the word synonym determination unit 144 calculates the correlation coefficient by the following formula (1), for example.

式（１）において、ｘ_ｍは未確定単語ペアに対するｍ回目の意味類似度の学習処理の結果（意味類似度）であり、ｘａは未確定単語ペアに対するＭ回分の学習処理の結果の相加平均である。また、式（１）において、ｙ_ｍは同義単語ペアに対するｍ回目の意味類似度の学習処理の結果（意味類似度）であり、ｙａは同義単語ペアに対するＭ回分の学習処理の結果の相加平均である。 In Expression (1), x _m is the result of the m-th learning process of the semantic similarity with respect to the undetermined word pair (semantic similarity), and xa is the addition of the M-th learning process results with respect to the undetermined word pair. Average. Further, in the equation (1), y _m is the result of a learning process of the m-th mean similarity to synonymous word pairs (meaning similarity), ya is additive results M times of learning process for the synonymous word pairs Average.

第２のループ処理のループ端（ステップＳ７４０３，Ｓ７４０６）では、全ての同義単語ペアのなかから処理対象の同義単語ペアを選択する処理と、全ての同義単語ペアで相関係数を算出する処理を行ったか否かの判定とを行う。全ての同義単語ペアで相関係数を算出する処理を行うと、単語同義判定部１４４は、第２のループ処理を終了する。 At the loop end of the second loop process (steps S7403, S7406), a process of selecting a synonymous word pair to be processed from all synonymous word pairs and a process of calculating a correlation coefficient for all synonymous word pairs are performed. It is determined whether or not it has been performed. After performing the process of calculating the correlation coefficient for all synonymous word pairs, the word synonym determination unit 144 ends the second loop process.

第２のループ処理を終了すると、単語同義判定部１４４は、次に、現在処理対象である未確定単語ペアについての意味類似度の相関係数の平均値を算出する（ステップＳ７４０７）。ステップＳ７４０７において、単語同義判定部１４４は、第２のループ処理で算出した各同義単語ペアとの意味類似度の相関係数の平均値を算出する。 When the second loop processing is completed, the word synonym determination unit 144 then calculates the average value of the correlation coefficients of the semantic similarity for the undetermined word pair that is the current processing target (step S7407). In step S7407, the word synonym determination unit 144 calculates the average value of the correlation coefficient of the semantic similarity with each synonymous word pair calculated in the second loop processing.

次に、単語同義判定部１４４は、図７Ｂに示すように、算出した相関係数の平均値が閾値以上であるか否かを判定する（ステップＳ７４０８）。相関係数の平均値が閾値以上である場合（ステップＳ７４０８；ＹＥＳ）、単語同義判定部１４４は、単語ペアリストの現在処理対象である未確定単語ペアを同義単語ペアに変更する（ステップＳ７４０９）。一方、相関係数の平均値が閾値よりも小さい場合（ステップＳ７４０８；ＮＯ）、単語同義判定部１４４は、現在処理対象である未確定単語ペアを単語ペアリストから削除する（ステップＳ７４１０）。 Next, as shown in FIG. 7B, the word synonym determination unit 144 determines whether the calculated average value of the correlation coefficients is equal to or more than the threshold value (step S7408). When the average value of the correlation coefficient is equal to or more than the threshold value (step S7408; YES), the word synonym determination unit 144 changes the undetermined word pair that is the current processing target of the word pair list to the synonymous word pair (step S7409). .. On the other hand, when the average value of the correlation coefficient is smaller than the threshold value (step S7408; NO), the word synonym determination unit 144 deletes the undetermined word pair that is the current processing target from the word pair list (step S7410).

現在処理対象である未確定単語ペアに対する第２のループ処理、及びステップＳ７４０７〜Ｓ７４１０の処理を終えると、単語同義判定部１４４は、これらの処理を行っていない未確定単語ペアがあるか否かを判定する（ステップＳ７４０２又はＳ７４１１）。第２のループ処理、及びステップＳ７４０７〜Ｓ７４１０の処理を行っていない未確定単語ペアがある場合、単語同義判定部１４４は、未処理の未確定単語ペアに対する第２のループ処理、及びステップＳ７４０７〜Ｓ７４１０の処理を行う。そして、全ての未確定単語ペアに対して第２のループ処理、及びステップＳ７４０７〜Ｓ７４１０の処理を行うと、単語同義判定部１４４は、第１のループ処理を終了し、単語ペアについての同義判定処理を終了する。 When the second loop process for the undetermined word pair that is the current processing target and the processes of steps S7407 to S7410 are finished, the word synonym determination unit 144 determines whether or not there is an undetermined word pair that has not been subjected to these processes. Is determined (step S7402 or S7411). If there is an undetermined word pair that has not been subjected to the second loop processing and steps S7407 to S7410, the word synonym determination unit 144 performs the second loop processing for the unprocessed undetermined word pair, and steps S7407 to. The process of S7410 is performed. Then, when the second loop process and the processes of steps S7407 to S7410 are performed on all the undetermined word pairs, the word synonym determination unit 144 ends the first loop process, and the synonym determination on the word pairs is performed. The process ends.

単語同義判定部１４４による単語同義判定処理（ステップＳ７４）が終了すると、同義複合名詞判定部１４０は、次に、複合名詞のペアについての同義判定処理（ステップＳ７５）を行う。ステップＳ７５の処理は、同義複合名詞判定部１４０の複合名詞同義判定部１４５が行う。複合名詞同義判定部１４５は、ステップＳ７５の同義判定処理として、図８に示す処理を行う。 When the word synonym determination process (step S74) by the word synonym determination unit 144 ends, the synonym compound noun determination unit 140 then performs the synonym determination process (step S75) for the pair of compound nouns. The process of step S75 is performed by the compound noun synonym determination unit 145 of the synonym compound noun determination unit 140. The compound noun synonym determination unit 145 performs the process shown in FIG. 8 as the synonym determination process of step S75.

図８は、複合名詞のペアについての同義判定処理の内容を説明するフローチャートである。 FIG. 8 is a flowchart illustrating the contents of the synonym determination process for a pair of compound nouns.

複合名詞のペアについての同義判定処理において、複合名詞同義判定部１４５は、ループ処理（Ｓ７５０１〜Ｓ７５０５）を行う。ループ処理における１回の処理（ステップＳ７５０２〜Ｓ７５０４）は、同義表現の可能性がある１組の複合名詞のペアにおける複数の単語ペアが全て同義単語ペアである場合に、当該複合名詞のペアを同義表現リスト１９６に登録する処理である。複合名詞同義判定部１４５は、同義表現の可能性がある全ての複合名詞のペアのそれぞれに対してステップＳ７５０２〜Ｓ７５０４の処理を行う。ループ処理のループ端（ステップＳ７５０１，Ｓ７５０５）において、複合名詞同義判定部１４５は、複合名詞のペアを選択する処理と、同義表現の可能性がある全ての複合名詞のペアに対してステップＳ７５０２〜Ｓ７５０４の処理を行ったか否かの判定とを行う。 In the synonym determination process for the pair of compound nouns, the compound noun synonym determination unit 145 performs a loop process (S7501 to S7505). One process (steps S7502 to S7504) in the loop process is performed when a plurality of word pairs in a pair of compound nouns that may be synonymous are all synonymous word pairs. This is a process of registering in the synonym expression list 196. The compound noun synonym determination unit 145 performs the processes of steps S7502 to S7504 for each of all compound noun pairs that may be synonymous. At the loop end of the loop processing (steps S7501 and S7505), the compound noun synonym determination unit 145 selects the compound noun pairs, and steps S7502 to all compound noun pairs that may be synonymous. It is determined whether or not the process of S7504 has been performed.

ループ処理における１回の処理において、複合名詞同義判定部１４５は、まず、現在処理対象に選択されている複合名詞のペアにおける複数の単語ペアのそれぞれについての同義判定処理の結果を参照する（ステップＳ７５０２）。 In one process in the loop process, the compound noun synonym determination unit 145 first refers to the result of the synonym determination process for each of the plurality of word pairs in the pair of compound nouns currently selected as the processing target (step S7502).

次に、複合名詞同義判定部１４５は、複合名詞のペアにおける複数の単語ペアが全て同義単語ペアであるか否かを判定する（ステップＳ７５０３）。全て同義単語ペアである場合（ステップＳ７５０３；ＹＥＳ）、複合名詞同義判定部１４５は、現在処理対象である複合名詞のペアを同義表現リスト１９６に登録し（ステップＳ７５０４）、ループ処理を終了するか否かの判断をする（ステップＳ７５０５）。一方、複数の単語ペアのなかに同義ではない単語ペアが含まれる場合（ステップＳ７５０３；ＮＯ）、複合名詞同義判定部１４５は、ステップＳ７５０４をスキップし、ループ処理を終了するか否かの判断をする（ステップＳ７５０５）。全ての複合名詞のペアに対しステップＳ７５０２〜Ｓ７５０４の処理を行った場合、複合名詞同義判定部１４５は、ループ処理を終了し、複合名詞についての同義判定処理を終了する。 Next, the compound noun synonym determination unit 145 determines whether all the plurality of word pairs in the pair of compound nouns are synonymous word pairs (step S7503). If all are synonymous word pairs (step S7503; YES), the compound noun synonym determination unit 145 registers the pair of compound nouns currently being processed in the synonym expression list 196 (step S7504) and terminates the loop process. It is determined whether or not (step S7505). On the other hand, when a plurality of word pairs include a word pair that is not synonymous (step S7503; NO), the compound noun synonym determination unit 145 skips step S7504 and determines whether to end the loop processing. Yes (step S7505). When the processes of steps S7502 to S7504 have been performed for all pairs of compound nouns, the compound noun synonym determination unit 145 ends the loop process and ends the synonym determination process for the compound noun.

このように、本実施形態に係る複合名詞についての同義表現の判定処理では、複合名詞のペアにおける各複合名詞を単語に分割して複数の単語ペアを生成し、複数の単語ペアの全てが同義の単語ペアである複合名詞のペアを同義表現と判定する。また、同義表現の可能性がある複合名詞のペアについては、同義であるか否かが未確定である単語ペアと、同義である単語ペアとのそれぞれで行った意味類似度の学習結果に基づいて、同義であるか否かが未確定である単語ペアが同義であるか否かを判定する。この際、各単語ペアの意味類似度の学習は、複数回にわけて１回毎に学習事例を追加する態様で行い、事例数と意味類似度との関係を算出する。このような方法で単語ペアの意味類似度を学習した場合、同義単語ペアでは、事例数が増加するとともに意味類似度の値が大きくなる。また、複数の同義単語ペアにおける事例数と意味類似度との関係には、同じ傾向が見られる。このため、同義であるか否かが未確定である単語ペアにおける事例数と意味類似度との関係に、同義単語ペアにおける事例数と意味類似度との関係と同じ傾向がある場合、未確定単語ペアの単語同士を同義であると判定することが可能となる。よって、本実施形態によれば、文書集合１９１の文書データ数が少ない場合や、特定の分野のみで使われる複合名詞を含む文書がある場合でも、複合名詞の同義表現を精度良く抽出することが可能となる。 As described above, in the synonymous expression determination process for the compound noun according to the present embodiment, each compound noun in the pair of compound nouns is divided into words to generate a plurality of word pairs, and all of the plurality of word pairs have the same meaning. A compound noun pair that is a word pair of is determined to be a synonymous expression. In addition, for compound noun pairs that may have synonymous expressions, based on the learning results of the semantic similarity performed for each word pair whose synonym or not is undetermined and each synonymous word pair. Then, it is determined whether or not the word pair whose synonym is undetermined is synonymous. At this time, the learning of the semantic similarity of each word pair is performed in a mode in which a learning case is added every one of a plurality of times, and the relationship between the number of cases and the semantic similarity is calculated. When the semantic similarity of a word pair is learned by such a method, the value of the semantic similarity increases as the number of cases increases in the synonymous word pair. Moreover, the same tendency is observed in the relationship between the number of cases and the semantic similarity in a plurality of synonymous word pairs. Therefore, if the relationship between the number of cases and meaning similarity in a word pair whose synonym is undetermined has the same tendency as the relationship between the number of cases and meaning similarity in a synonymous word pair, it is undetermined. It is possible to determine that the words in the word pair have the same meaning. Therefore, according to the present embodiment, even if the number of pieces of document data in the document set 191 is small or there is a document including a compound noun used only in a specific field, the synonymous expression of the compound noun can be accurately extracted. It will be possible.

図９は、文字列と形態素解析の結果との例を示す図である。図１０は、複合名詞のペアの抽出結果と単語ペアのリストとの例を示す図である。 FIG. 9 is a diagram showing an example of a character string and a result of morphological analysis. FIG. 10 is a diagram showing an example of a result of extracting a pair of compound nouns and a list of word pairs.

図９の（ａ）のテーブル４０１には、文書集合１９１から抽出した文字列の例を示している。本実施形態に係る同義表現抽出装置１は、形態素解析部１２０において、文書集合１９１から抽出した文字列のそれぞれに対し形態素解析（ステップＳ２）を行う。図９の（ｂ）のテーブル４０２には、図９の（ａ）のテーブ４０１に示した「交通費を求めるため運賃計算モジュールを実行する」という文字列に対する形態素解析の結果と、「交通費を求めるため交通費精算機能を実施する」という文字列に対する形態素解析の結果とを具体的に示している。なお、テーブル４０２における記号「 / 」は、形態素の区切りを示す。解析結果コーパス１９３には、例えば、テーブル４０２のような形式で各文（各文字列）に対する形態素解析の結果が蓄積される。 The table 401 in FIG. 9A shows an example of character strings extracted from the document set 191. In the synonym expression extraction device 1 according to the present exemplary embodiment, the morpheme analysis unit 120 performs morpheme analysis (step S2) on each of the character strings extracted from the document set 191. In the table 402 of FIG. 9B, the result of the morphological analysis for the character string “execute the fare calculation module to obtain the transportation cost” shown in the table 401 of FIG. The result of the morphological analysis for the character string "I carry out the transportation expense reimbursement function in order to obtain" is specifically shown. The symbol “/” in the table 402 indicates a morpheme delimiter. In the analysis result corpus 193, for example, the result of morphological analysis for each sentence (each character string) is accumulated in a format like the table 402.

形態素解析を行った後、同義表現抽出装置１は、複合名詞抽出部１３０において、文字列から複合名詞を抽出し（ステップＳ３）、更に同義表現の可能性がある複合名詞のペアを同定する（ステップＳ５）。複合名詞の条件の１つとして複数の名詞が連続している単語列という条件が設定されている場合、「交通費を求めるため運賃計算モジュールを実行する」という文字列からは、「運賃計算モジュール」という複合名詞が抽出される。また、「交通費を求めるため交通費精算機能で実施する」という文字列からは「交通費精算機能」という複合名詞が抽出される。また、図示は省略しているが、他の文字列からも同様の条件に従って複合名詞が抽出される。 After performing the morphological analysis, the synonymous expression extraction device 1 extracts a compound noun from the character string in the compound noun extraction unit 130 (step S3), and further identifies a pair of compound nouns that may be synonymous expressions ( Step S5). When the condition of a word string in which a plurality of nouns are consecutive is set as one of the conditions of the compound noun, from the character string "Run the fare calculation module to obtain the transportation cost", "Fare calculation module" is displayed. A compound noun "" is extracted. In addition, a compound noun “transportation expense settlement function” is extracted from the character string “execute by the transportation expense settlement function to obtain transportation expenses”. Although not shown, compound nouns are extracted from other character strings according to the same conditions.

複合名詞を抽出した後、複合名詞抽出部１３０は、複合名詞についての文脈類似度等に基づいて、同義表現の可能性がある複合名詞のペアを同定する。例示した上記の２つの文字列は、いずれも「交通費を求めるため（複合名詞）を・・・」という文になっており、文脈類似度が高い。このため、複合名詞出部１３０は、図１０の（ａ）のテーブル４０３のように、「運賃計算モジュール」と「交通費精算機能」とを、同義表現の可能性がある複合名詞のペアに同定する。 After extracting the compound noun, the compound noun extraction unit 130 identifies a pair of compound nouns that may be synonymous based on the context similarity or the like of the compound noun. The above-mentioned two character strings illustrated are both sentences “To obtain transportation expenses (compound noun)... ”, and have high context similarity. Therefore, the compound noun output unit 130 sets the “fare calculation module” and the “transportation expense settlement function” into a pair of compound nouns that may be synonymous, as in the table 403 of FIG. Identify.

その後、同義表現抽出装置１では、単語ペア設定部１４１において、複合名詞のペアにおける複数の単語ペアを同義単語ペア又は判定対象単語ペアに同定するとともに、文書集合１９１から同義単語ペアを収集する（ステップＳ７１及びＳ７２）。 Then, in the synonym expression extraction device 1, the word pair setting unit 141 identifies a plurality of word pairs in the pair of compound nouns as synonymous word pairs or determination target word pairs, and collects synonymous word pairs from the document set 191 ( Steps S71 and S72).

「運賃計算モジュール」と「交通費精算機能」とは、上記のように、それぞれ、３個の単語（名詞）を組み合わせた複合名詞である。このため、「運賃計算モジュール」と「交通費精算機能」とのペアからは、図１０の（ｂ）の単語ペアリスト４０４に示すように、「運賃」と「交通費」との単語ペア、「計算」と「精算」との単語ペア、及び「モジュール」と「機能」との単語ペアが生成される。単語ペア設定部１４１は、同義単語辞書１９４を参照し、これら３個の単語ペアが同義単語ペア及び判定対象単語ペアのいずれであるかを同定する。図２の同義単語辞書１９４を参照した場合、「運賃」と「交通費」との単語ペア、及び「計算」と「精算」との単語ペアは、同義単語ペアとなる。これに対し、「モジュール」と「機能」との単語ペアは未確定単語ペアとなる。このため、同義単語辞書１９４は、単語ペアリスト４０４に、３個の単語ペアと、各単語ペアの属性（未確定又は同義）を登録する。また、単語ペアリスト４０４に単語ペアを登録する際には、図１０の（ｂ）に示したように、各単語ペアを識別する配列ＩＤを付与する。 As described above, the “fare calculation module” and the “transportation expense settlement function” are compound nouns in which three words (nouns) are combined. Therefore, from the pair of the "fare calculation module" and the "transportation expense settlement function", as shown in the word pair list 404 of FIG. 10B, the word pair of "fare" and "transportation expense", Word pairs of "calculation" and "settlement" and word pairs of "module" and "function" are generated. The word pair setting unit 141 refers to the synonym word dictionary 194 and identifies whether these three word pairs are synonymous word pairs or determination target word pairs. When referring to the synonym word dictionary 194 of FIG. 2, the word pair of “fare” and “transportation expense” and the word pair of “calculation” and “settlement” are synonymous word pairs. On the other hand, the word pair of "module" and "function" is an undetermined word pair. Therefore, the synonym word dictionary 194 registers three word pairs and the attributes (undetermined or synonymous) of each word pair in the word pair list 404. Further, when registering a word pair in the word pair list 404, as shown in FIG. 10B, an array ID for identifying each word pair is given.

更に、判定対象同定部１４１は、全ての文字列についての形態素解析の結果から、同義単語辞書１９４に登録されている同義単語ペアを収集し、単語ペアリスト４０４に登録する。図２の同義単語辞書１９４には、「実行」と「実施」との組が同義単語ペアとして登録されている。また、図９の（ｂ）のテーブル４０２に示した形態素解析の結果には、「実行」という単語を含む文字列と、「実施」という単語を含む文字列とが存在する。このため、判定対象同定部１４１は、図１０の（ｂ）の単語ペアリスト４０４のように、「実行」と「実施」との単語ペアを同義単語ペアとして登録する。また、判定対象同定部１４１は、全ての文字列についての形態素解析の結果から同義単語辞書１９４に登録された同義単語ペアを全て収集し、単語ペアリスト４０４に登録する。 Furthermore, the determination target identifying unit 141 collects synonymous word pairs registered in the synonymous word dictionary 194 from the results of morphological analysis for all character strings, and registers them in the word pair list 404. In the synonym word dictionary 194 of FIG. 2, a pair of “execution” and “implementation” is registered as synonymous word pairs. Further, the result of the morphological analysis shown in the table 402 of FIG. 9B includes a character string including the word “execution” and a character string including the word “implementation”. For this reason, the determination target identifying unit 141 registers the word pair of “execution” and “implementation” as a synonymous word pair, as in the word pair list 404 of FIG. The determination target identifying unit 141 also collects all synonymous word pairs registered in the synonym word dictionary 194 from the results of morphological analysis for all character strings, and registers them in the word pair list 404.

複合名詞のペアにおける未確定単語ペアと、同義単語辞書１９４に登録されている同義単語ペアとを単語ペアリスト４０４に登録した後、同義表現抽出装置１は、意味類似度学習処理（ステップＳ７３）を行う。意味類似度学習処理では、類似度推移テーブル作成部１４２と、意味類似度学習部１４３とが、図６のステップＳ７３０１〜Ｓ７３０９の処理を行い、図１１に示したような類似度推移テーブル１９５を作成する。 After registering the undetermined word pair in the pair of compound nouns and the synonym word pair registered in the synonym word dictionary 194 in the word pair list 404, the synonym expression extraction device 1 performs the semantic similarity degree learning process (step S73). I do. In the semantic similarity learning process, the similarity transition table creation unit 142 and the semantic similarity learning unit 143 perform the processes of steps S7301 to S7309 of FIG. 6 to generate the similarity transition table 195 as shown in FIG. create.

図１１は、類似度推移テーブルの例を示す図である。
図１１に示すように、類似度推移テーブル１９５は、Ｎ個の単語ペアのそれぞれに対するＭ回の学習処理の結果を格納するデータ格納欄（セル）を持つ。各単語ペアにおけるＭ回の学習処理の結果は、それぞれ、単語ペアに付与された配列ＩＤと、学習回数ｍとで指定されるデータ格納欄に格納される。また、図１１の類似度推移テーブル１９５では、１回の意味類似度の学習処理で取得する事例数ＨをＨ＝１０００としている。すなわち、図１１の類似度推移テーブル１９５におけるｍ回目の学習処理の結果は、ｍ×１０００個（ｍ＝１，２，・・・，Ｍ）の学習事例を用いて行った学習処理の結果となる。 FIG. 11 is a diagram showing an example of the similarity transition table.
As shown in FIG. 11, the similarity transition table 195 has a data storage column (cell) that stores the results of M learning processes for each of N word pairs. The result of the learning process performed M times for each word pair is stored in the data storage field specified by the array ID assigned to the word pair and the learning count m. Further, in the similarity transition table 195 of FIG. 11, the number H of cases acquired in one learning process of the semantic similarity is H=1000. That is, the result of the m-th learning process in the similarity transition table 195 of FIG. 11 is the result of the learning process performed using m×1000 (m=1, 2,..., M) learning cases. Become.

類似度推移テーブル１９５を作成した後、同義表現抽出装置１は、単語同義判定部１４４において、単語ペアについての同義判定処理（ステップＳ７４）を行う。ステップＳ７４の処理は、未確定単語ペアが同義単語ペアであるか否かを判定する処理である。単語同義判定部１４４は、図７Ａ及び図７ＢのステップＳ７４０１〜Ｓ７４１１の処理を行い、未確定単語ペアが同義単語ペアであるか否かを判定する。 After creating the similarity transition table 195, the synonym expression extraction device 1 causes the word synonym determination unit 144 to perform synonym determination processing for a word pair (step S74). The process of step S74 is a process of determining whether or not the undetermined word pair is a synonymous word pair. The word synonym determination unit 144 performs the processes of steps S7401 to S7411 of FIGS. 7A and 7B and determines whether the undetermined word pair is a synonymous word pair.

図１２は、未確定単語ペアが同義表現であるか否かの判定方法を説明する図である。
図１２の（ａ）には、類似度推移テーブル１９５における同義単語ペアについての事例数と意味類似度との関係を示す３本の直線と、未確定単語ペアについての事例数と意味類似度との関係を示す２本の直線とを示している。同義単語ペアは、同義単語辞書１９４において同義であると定義されている単語のペアである。また、同義単語ペアにおける各単語は、複数の単語を組み合わせた複合名詞と比べ、文書集合１９１における出現頻度が高い。このため、同義単語ペアにおける各単語の意味ベクトルの学習結果に基づいて算出される同義単語ペアの意味類似度は、図１２の（ａ）に示したように、事例数（学習データ量）に比例して高くなる傾向がある。すなわち、単語ペアにおける２個の単語が同義である場合、当該単語ペアの意味類似度は、事例数（学習データ量）に比例して高くなる傾向があるといえる。 FIG. 12 is a diagram illustrating a method of determining whether or not an undetermined word pair is a synonymous expression.
In FIG. 12A, three straight lines showing the relationship between the number of cases of synonymous word pairs and the meaning similarity in the similarity transition table 195, the number of cases and the meaning similarity of undetermined word pairs. And two straight lines showing the relationship of. A synonym word pair is a pair of words defined as synonymous in the synonym word dictionary 194. Further, each word in the synonymous word pair has a higher appearance frequency in the document set 191 than a compound noun in which a plurality of words are combined. Therefore, the semantic similarity of the synonymous word pair calculated based on the learning result of the semantic vector of each word in the synonymous word pair is determined by the number of cases (learning data amount) as shown in (a) of FIG. Tends to be proportionally higher. That is, when two words in a word pair are synonymous, it can be said that the semantic similarity of the word pair tends to increase in proportion to the number of cases (learning data amount).

したがって、未確定単語ペアにおける２個の単語が同義である場合、当該単語ペアについての事例数と意味類似度との関係には、同義単語ペアにおける事例数と意味類似度との関係と同じ傾向があると考えられる。すなわち、未確定単語ペアにおける２個の単語が同義である場合、当該単語ペアの意味類似度は、図１２の（ａ）に太い実線で示したように、事例数（学習データ量）に比例して高くなると考えられる。一方、未確定単語ペアにおける２個の単語が同義ではない場合、当該単語ペアについての事例数と意味類似度との関係には、図１２の（ａ）に太い点線で示したように、同義単語ペアにおける事例数と意味類似度との関係とは異なる傾向があるといえる。 Therefore, when two words in an undetermined word pair have the same meaning, the relationship between the number of cases and the semantic similarity for the word pair has the same tendency as the relationship between the number of cases and the meaning similarity for the synonymous word pair. It is thought that there is. That is, when two words in an undetermined word pair are synonymous, the semantic similarity of the word pair is proportional to the number of cases (learning data amount) as shown by the thick solid line in FIG. It is thought that it will become high. On the other hand, when the two words in the undetermined word pair are not synonymous, the relationship between the number of cases and the semantic similarity for the word pair is synonymous as shown by the thick dotted line in (a) of FIG. It can be said that there is a tendency that the relationship between the number of cases and semantic similarity in word pairs differs.

よって、本実施形態では、図７ＡのステップＳ７４０３〜Ｓ７４０７のような処理を行い、未確定単語ペアの意味類似度の推移と、同義単語ペアの意味類似度の推移との相関係数を算出する。例えば、図１１の類似度推移テーブル１９５では、配列ＩＤが「１」の単語ペアが未確定単語ペアであり、他の配列ＩＤの単語ペアは同義単語ペアである（図１０の（ｂ）を参照）。このため、単語同義判定部１４４は、図１２の（ｂ）のテーブル４０５のように、一方の配列ＩＤを「１」とする配列ＩＤのペアのそれぞれで、意味類似度の相関係数を算出する。その後、単語同義判定部１４４は、算出した複数の意味類似度の相関係数の平均値を算出し、当該平均値が閾値以上であれば、一方の単語の配列ＩＤが「１」である単語ペアは同義単語ペアであると判定する。図１２の（ｂ）に示した例では、相関係数の平均値が０．６８となっている。したがって、図７ＢのステップＳ７４０８で用いる判定閾値が０．６８以下である場合には、配列ＩＤが「１」の単語ペア（「モジュール」と「機能」とのペア）が同義であると判定される。この場合、単語同義判定部１４４は、例えば、図１０の（ｂ）の単語ペアリスト４０４における「モジュール」と「機能」との単語ペアの属性が、未確定から同義に変更する（ステップＳ７４０９）。一方、図７ＢのステップＳ７４０８で用いる判定閾値が０．６８よりも大きい場合には、配列ＩＤが「１」の単語ペアは同義ではないと判定される。この場合、単語同義判定部１４４は、単語ペアリスト４０４から「モジュール」と「機能」との単語ペアを削除する（ステップＳ７４１０）。 Therefore, in the present embodiment, the processing of steps S7403 to S7407 in FIG. 7A is performed to calculate the correlation coefficient between the transition of the semantic similarity of the undetermined word pair and the transition of the semantic similarity of the synonymous word pair. .. For example, in the similarity transition table 195 in FIG. 11, the word pair with the array ID “1” is an undetermined word pair, and the word pairs with other array IDs are synonymous word pairs (see (b) in FIG. 10). reference). For this reason, the word synonym determination unit 144 calculates the correlation coefficient of the semantic similarity for each pair of array IDs in which one array ID is “1”, as in the table 405 of FIG. 12B. To do. After that, the word synonym determination unit 144 calculates the average value of the calculated correlation coefficients of the plurality of semantic similarities, and if the average value is equal to or more than the threshold value, the word whose array ID of one word is “1” The pair is determined to be a synonymous word pair. In the example shown in FIG. 12B, the average value of the correlation coefficient is 0.68. Therefore, when the determination threshold used in step S7408 of FIG. 7B is 0.68 or less, it is determined that the word pair with the array ID “1” (the pair of “module” and “function”) is synonymous. It In this case, the word synonym determination unit 144 changes the attribute of the word pair “module” and “function” in the word pair list 404 of FIG. 10B from undetermined to synonymous (step S7409). .. On the other hand, when the determination threshold value used in step S7408 of FIG. 7B is larger than 0.68, it is determined that the word pair having the array ID “1” is not synonymous. In this case, the word synonym determination unit 144 deletes the word pair of “module” and “function” from the word pair list 404 (step S7410).

ステップＳ７４の処理の後、同義表現抽出装置１では、複合名詞同義判定部１４５において、同義表現の可能性がある複合名詞のペアが同義表現であるか否かの判定を行う。同義表現の可能性がある複合名詞のペアは、上記のように、該複合名詞のペアにおける複数の単語ペアのなかに未確定単語ペアが含まれる。未確定単語ペアについては、ステップＳ７４の単語についての同義判定処理により、同義単語ペアであるか否かの判定を済ませている。未確定単語ペアが同義単語ペアである場合、単語ペアリスト４０４における未確定単語ペアの属性は、未確定から同義に変更されている。一方、未確定単語ペアが同義単語ペアではない場合、該未確定単語ペアは単語ペアリスト４０４から削除されている。また、複合名詞のペアにおける同義単語ペアは、単語ペアリスト４０４に登録されている。したがって、複合名詞のペアにおける複数の単語ペアが全て単語ペアリスト４０４に同義単語ペアとして登録されている場合、複合名詞同義判定部１４５は、当該複合名詞のペアを同義表現であると判定し、同義表現リスト１９６に登録する。 After the processing of step S74, in the synonym expression extraction device 1, the compound noun synonym determination unit 145 determines whether a pair of compound nouns that may be synonymous expressions are synonymous expressions. As described above, a pair of compound nouns that may be synonymous with each other includes an undetermined word pair among a plurality of word pairs in the pair of compound nouns. With respect to the undetermined word pair, it is already determined whether or not it is a synonymous word pair by the synonym determination process for the word in step S74. When the undetermined word pair is a synonymous word pair, the attribute of the undetermined word pair in the word pair list 404 is changed from undetermined to synonymous. On the other hand, when the undetermined word pair is not a synonymous word pair, the undetermined word pair is deleted from the word pair list 404. Further, synonymous word pairs in the pair of compound nouns are registered in the word pair list 404. Therefore, when a plurality of word pairs in a pair of compound nouns are all registered as synonymous word pairs in the word pair list 404, the compound noun synonym determination unit 145 determines that the pair of compound nouns is a synonymous expression, Register in the synonymous expression list 196.

このように、本実施形態に係る同義表現抽出装置１では、文書集合１９１から抽出した複合名詞のペアにおいて同義であるか否かが未確定の単語ペアと、同義単語ペアとで、学習に用いる事例（データ）を追加しながら意味類似度の学習を複数回行う。そして、同義単語ペアにおける学習回数と意味類似度との関係と、未確定単語ペアにおける学習回数と意味類似度との関係とについての相関係数の平均値が閾値以上である場合には、未確定単語ペアを同義単語ペアと判定する。そして、本実施形態では、複合名詞のペアにおける複数の単語ペアが全て同義単語ペアであるか否かにより、該複合名詞のペアが同義表現であるか否かを判定する。よって、本実施形態によれば、文書集合１９１における出現頻度が少ない複合名詞のペアに対する同義表現であるか否かの判定精度を向上させることが可能となる。 As described above, in the synonym expression extraction device 1 according to the present exemplary embodiment, a synonym word pair and a word pair whose synonym is undetermined in the pair of compound nouns extracted from the document set 191 are used for learning. The semantic similarity is learned multiple times while adding cases (data). Then, if the average value of the correlation coefficient for the relationship between the learning frequency and the semantic similarity in the synonymous word pair and the relationship between the learning frequency and the semantic similarity in the undetermined word pair is equal to or greater than the threshold, Definite word pairs are determined to be synonymous word pairs. Then, in the present embodiment, it is determined whether or not the pair of compound nouns is a synonymous expression depending on whether or not all of the plurality of word pairs in the pair of compound nouns are synonymous word pairs. Therefore, according to the present embodiment, it is possible to improve the accuracy of determination as to whether or not the synonymous expression is for a pair of compound nouns having a low appearance frequency in the document set 191.

なお、本実施形態における各単語ペアの意味類似度の学習処理の回数Ｍは、適宜設定可能であるが、２０回以上（Ｍ≧２０）とすることが好ましい。 The number M of times of learning processing of the semantic similarity of each word pair in this embodiment can be set as appropriate, but is preferably 20 times or more (M≧20).

図１３は、学習処理の回数の決定方法を説明するグラフ図である。
図１３のグラフ図において、横軸は意味類似度の学習処理の実行回数である。また、図１３のグラフ図において、左の縦軸は意味類似度の相関係数についての有意性検定値であり、右の縦軸は学習処理に要する処理時間である。 FIG. 13 is a graph illustrating a method of determining the number of learning processes.
In the graph of FIG. 13, the horizontal axis represents the number of times the semantic similarity learning process is executed. In the graph of FIG. 13, the left vertical axis is the significance test value for the correlation coefficient of the semantic similarity, and the right vertical axis is the processing time required for learning processing.

各単語ペアに対する学習処理の実行回数と、有意性検定値との関係は、図１３に太い実線で示したように、実行回数が増えるとともに意味類似度の相関係数についての有意性検定値が減少する。有意性検定値は、相関係数が統計的に有意性であるか否かを検定した結果であり、既知の検定方法により算出される。この有意性検定値は、意味類似度の相関係数の信頼度と関係があり、有意性検定値が小さいほど、意味類似度の相関係数の信頼度が高くなる。すなわち、図１３に示したように、各単語ペアに対する学習処理の実行回数を多くすると、２個の単語ペアの意味類似度の相関係数の信頼度が高くなる。しかしながら、各単語ペアに対する学習処理の実行回数を多くすると、図１３に太い点線で示したように、学習処理に要する処理時間が増大する。よって、各単語ペアに対する学習処理の実行回数Ｍは、所望する意味類似度の相関係数の信頼度（単語ペアが同義表現であるか否かの判定精度）や、同義表現抽出装置１の処理能力、所望するレスポンス等に基づいて、適宜設定すればよい。例えば、有意性検定値（９９％以上の信頼度）で０．５以下となる信頼度を所望する場合には、各単語ペアに対する学習処理の実行回数Ｍを３０回程度にすればよい。また、例えば、有意性検定値（９９％以上の信頼度）で０．６以下となる信頼度を所望する場合には、各単語ペアに対する学習処理の実行回数Ｍを２０回程度に減らすこと可能となる。 The relationship between the number of executions of the learning process for each word pair and the significance test value is that the significance test value for the correlation coefficient of the semantic similarity increases as the number of executions increases, as shown by the thick solid line in FIG. Decrease. The significance test value is a result of testing whether or not the correlation coefficient is statistically significant, and is calculated by a known test method. This significance test value is related to the reliability of the correlation coefficient of the semantic similarity, and the smaller the significance test value, the higher the reliability of the correlation coefficient of the semantic similarity. That is, as shown in FIG. 13, when the number of executions of the learning process for each word pair is increased, the reliability of the correlation coefficient of the semantic similarity between two word pairs is increased. However, if the number of times the learning process is executed for each word pair is increased, the processing time required for the learning process increases as shown by the thick dotted line in FIG. Therefore, the number of times M the learning process is executed for each word pair is determined by the reliability of the correlation coefficient of the desired semantic similarity (the determination accuracy of whether or not the word pair is a synonymous expression) and the processing of the synonymous expression extraction device 1. It may be appropriately set based on the capability, the desired response, and the like. For example, when the reliability of the significance test value (reliability of 99% or more) being 0.5 or less is desired, the number of times M the learning process is executed for each word pair may be set to about 30 times. Further, for example, when the reliability of the significance test value (reliability of 99% or more) being 0.6 or less is desired, the number of times M the learning process is executed for each word pair can be reduced to about 20 times. Becomes

なお、図４〜図８のフローチャートは、本実施形態に係る同義表現の抽出処理の一例に過ぎない。本実施形態に係る同義表現の抽出処理は、図４〜図８のフローチャートに限らず、本実施形態の要旨を逸脱しない範囲において処理内容を適宜変更可能である。例えば、意味類似度学習処理（ステップＳ７３）では、図１１のような類似度推移テーブル１９５を生成する代わりに、Ｎ個の単語ペアのそれぞれにおけるＭ回の学習処理の結果を順次格納するＮ個の配列を生成してもよい。また、例えば、単語ペアについての同義判定処理（ステップＳ７４）では、図７Ａにおける第２のループ処理を開始する前に現在処理対象である未確定単語ペアの意味類似度を取得してもよい。この場合、図７ＡにおけるステップＳ７４０４では、現在処理対象である同義単語ペアの意味類似度のみを取得すればよい。このため、第２のループ処理において、都度、未確定単語ペアの意味類似度を取得することによる処理時間の増加を抑えることが可能となる。 The flowcharts of FIGS. 4 to 8 are merely examples of the synonymous expression extraction processing according to the present embodiment. The synonymous expression extraction processing according to the present embodiment is not limited to the flowcharts of FIGS. 4 to 8, and the processing content can be appropriately changed without departing from the scope of the present embodiment. For example, in the semantic similarity learning processing (step S73), instead of generating the similarity transition table 195 as shown in FIG. 11, N pieces of the learning processing results for each of the N word pairs are sequentially stored N times. May generate an array of. Further, for example, in the synonym determination process for a word pair (step S74), the semantic similarity of the undetermined word pair which is the current processing target may be acquired before the second loop process in FIG. 7A is started. In this case, in step S7404 in FIG. 7A, only the semantic similarity of the synonymous word pair that is the current processing target needs to be acquired. Therefore, in the second loop processing, it is possible to suppress an increase in processing time caused by acquiring the semantic similarity of the undetermined word pair each time.

また、本実施形態に係る同義表現抽出装置１は、図１に示した構成に限らず、例えば、外部装置から文書集合１９１を取得する取得部を備えた装置であってもよい。また、同義表現抽出装置１は、複合名詞同義特定処理（ステップＳ７）の処理結果や、同義表現リスト１９６等を表示する表示部、或いは外部装置に出力する出力部を備えた装置であってもよい。 The synonymous expression extraction device 1 according to the present embodiment is not limited to the configuration shown in FIG. 1, and may be, for example, a device including an acquisition unit that acquires the document set 191 from an external device. Further, the synonym expression extraction device 1 may be a device that includes a display unit that displays the processing result of the compound noun synonym identification process (step S7), the synonym expression list 196, or the like, or an output unit that outputs the result to an external device. Good.

［第２の実施形態］
図１４は、第２の実施形態に係る同義表現抽出装置の機能的構成を示す図である。 [Second Embodiment]
FIG. 14 is a diagram showing a functional configuration of the synonymous expression extraction device according to the second exemplary embodiment.

図１４に示すように、本実施形態に係る同義表現抽出装置１は、文字列抽出部１１０と、形態素解析部１２０と、複合名詞抽出部１３０と、同義複合名詞特定部１４０と、を含む。また、同義表現抽出装置１は、文書集合１９１と、解析用辞書１９２と、解析結果コーパス１９３と、同義単語辞書１９４と、類似度推移テーブル１９５と、同義表現リスト１９６とを記憶する記憶部（図示せず）を備える。 As shown in FIG. 14, the synonym expression extraction device 1 according to the present embodiment includes a character string extraction unit 110, a morpheme analysis unit 120, a compound noun extraction unit 130, and a synonymous compound noun identification unit 140. Further, the synonym expression extraction device 1 stores a document set 191, an analysis dictionary 192, an analysis result corpus 193, a synonym word dictionary 194, a similarity transition table 195, and a synonym expression list 196 (a storage unit ( (Not shown).

本実施形態の同義表現抽出装置１における文字列抽出部１１０、形態素解析部１２０、複合名詞抽出部１３０、及び同義複合名詞特定部１４０は、それぞれ、第１の実施形態で説明した機能を持つ。なお、本実施形態の同義表現抽出装置１における同義複合名詞特定部１４０は、第１の実施形態で説明した機能に加え、未確定単語ペアが同義ペアであるか否かの判定に用いる判定閾値を設定する機能を持つ。 The character string extraction unit 110, the morpheme analysis unit 120, the compound noun extraction unit 130, and the synonymous compound noun identification unit 140 in the synonymous expression extraction device 1 of this exemplary embodiment each have the functions described in the first exemplary embodiment. The synonym compound noun specifying unit 140 in the synonym expression extraction device 1 of the present exemplary embodiment has the determination threshold used for determining whether or not the undetermined word pair is a synonymous pair in addition to the function described in the first exemplary embodiment. With the function to set.

本実施形態の同義表現抽出装置１における同義複合名詞特定部１４０は、単語ペア設定部１４１と、類似度推移テーブル作成部１４２と、意味類似度学習部１４３と、単語同義判定部１４４と、複合名詞同義判定部１４５と、を含む。本実施形態に係る同義複合名詞特定部１４０における単語ペア設定部１４１、類似度推移テーブル作成部１４２、意味類似度学習部１４３、単語同義判定部１４４、及び複合名詞同義判定部１４５は、それぞれ、第１の実施形態で説明した機能を持つ。 The synonym compound noun specifying unit 140 in the synonym expression extracting device 1 of the present exemplary embodiment includes a word pair setting unit 141, a similarity transition table creating unit 142, a semantic similarity learning unit 143, a word synonym determining unit 144, and a composite. And a noun synonym determination unit 145. The word pair setting unit 141, the similarity transition table creation unit 142, the semantic similarity learning unit 143, the word synonym determination unit 144, and the compound noun synonym determination unit 145 in the synonymous compound noun specifying unit 140 according to the present embodiment, respectively. It has the function described in the first embodiment.

また、本実施形態に係る同義複合名詞特定部１４０は、判定閾値設定部１４６を更に含む。判定閾値設定部１４６は、未確定単語ペアが同義ペアであるか否かの判定に用いる判定閾値を設定する。判定閾値設定部１４６は、単語ペア設定部１４１で文書集合から収集した複数の同義単語ペアのそれぞれにおける意味類似度の学習結果に基づいて、判定閾値を設定する。すなわち、判定閾値設定部１４６は、第１の実施形態で説明した意味類似度学習処理（ステップＳ７３）で作成した類似度推移テーブル１９５における同義単語ペアの学習結果に基づいて、判定閾値を設定する。 Moreover, the synonymous compound noun specifying unit 140 according to the present embodiment further includes a determination threshold setting unit 146. The determination threshold setting unit 146 sets a determination threshold used for determining whether or not the undetermined word pair is a synonymous pair. The determination threshold setting unit 146 sets the determination threshold based on the learning result of the semantic similarity in each of the plurality of synonymous word pairs collected from the document set by the word pair setting unit 141. That is, the determination threshold setting unit 146 sets the determination threshold based on the learning result of the synonymous word pair in the similarity transition table 195 created in the semantic similarity learning process (step S73) described in the first embodiment. ..

本実施形態に係る同義表現抽出装置１は、例えば、図４に示したステップＳ１〜Ｓ７の処理により、文書集合１９１から複合名詞の同義表現を抽出する。本実施形態に係る同義表現抽出装置１が行うステップＳ１〜Ｓ７の処理のうちのステップＳ１〜Ｓ６の処理は、それぞれ、第１の実施形態で説明したような処理内容でよい。これに対し、本実施形態に係る同義表現抽出装置１は、ステップＳ７の同義複合名詞特定処理に含まれる意味類似度学習処理において、未確定単語ペアが同義ペアであるか否かの判定に用いる判定閾値を設定する。このため、本実施形態に係る同義表現抽出装置１は、ステップＳ７の同義複合名詞特定処理として、例えば、図１５に示す処理を行う。 The synonymous expression extraction device 1 according to the present exemplary embodiment extracts the synonymous expression of the compound noun from the document set 191 by the processing of steps S1 to S7 shown in FIG. 4, for example. The processes of steps S1 to S6 of the processes of steps S1 to S7 performed by the synonym expression extraction device 1 according to the present exemplary embodiment may have the same processing content as described in the first exemplary embodiment. On the other hand, the synonymous expression extraction device 1 according to the present embodiment is used for determining whether or not the undetermined word pair is a synonymous pair in the semantic similarity learning process included in the synonymous compound noun identification process of step S7. Set the judgment threshold. Therefore, the synonym expression extraction device 1 according to the present exemplary embodiment performs, for example, the process shown in FIG. 15 as the synonymous compound noun specifying process in step S7.

図１５は、第２の実施形態に係る同義複合名詞特定処理の内容を説明するフローチャートである。 FIG. 15 is a flowchart illustrating the contents of the synonymous compound noun specifying process according to the second embodiment.

本実施形態においても、同義複合名詞特定処理は、同義複合名詞特定部１４０が行う。同義複合名詞特定部１４０は、図１５に示すように、まず、複合名詞のペアにおける同義単語ペアと未確定単語ペアとを同定する処理（ステップＳ７１）と、文書集合から同義単語辞書に登録されている同義単語ペアを収集する処理（ステップＳ７２）とを行う。ステップＳ７１及びＳ７２の処理は、単語ペア設定部１４１が行う。単語ペア設定部１４１は、第１の実施形態で説明した処理を行い、例えば、図１０の（ｂ）に示したような単語ペアリスト４０４を生成する。 Also in the present embodiment, the synonymous compound noun identifying process is performed by the synonymous compound noun identifying unit 140. As shown in FIG. 15, the synonymous compound noun specifying unit 140 first identifies the synonymous word pair and the undetermined word pair in the compound noun pair (step S71), and registers the synonymous word dictionary from the document set. And processing for collecting synonymous word pairs (step S72). The word pair setting unit 141 performs the processing of steps S71 and S72. The word pair setting unit 141 performs the processing described in the first embodiment, and generates the word pair list 404 as shown in FIG. 10B, for example.

次に、同義複合名詞特定部１４０は、ステップＳ７１及びＳ７２の処理により得られた未確定単語ペア及び同義単語ペアを処理対象の単語ペアとして意味類似度学習処理（ステップＳ７３）を行う。ステップＳ７３の意味類似度学習処理は、同義複合名詞特定部１４０の類似度推移テーブル作成部１４２と、意味類似度学習部１４３とが行う。例えば、類似度推移テーブル作成部１４２、及び意味類似度学習部１４３は、第１の実施形態で説明した処理（図６を参照）を行い、複数の単語ペアのそれぞれにおける意味類似度の学習結果の推移を示す類似度推移テーブル１９５（図１１を参照）を作成する。すなわち、本実施形態においても、意味類似度学習部１４３は、１組の単語ペアに対する意味類似度を学習する処理を複数回実行し、１回実行する毎に、学習に使用する事例（形態素解析の結果）を追加する。例えば、意味類似度学習部１４３は、１組の単語ペアに対するｍ回目（ｍ＝１，２，・・・，Ｍ）の処理における事例数ＨｔをＨｔ＝ｍ×Ｈ（例えば、Ｈ＝１０００）とする。 Next, the synonymous compound noun specifying unit 140 performs the semantic similarity learning process (step S73) with the undetermined word pair and the synonymous word pair obtained by the processes of steps S71 and S72 as the processing target word pair. The semantic similarity learning process of step S73 is performed by the similarity transition table creation unit 142 of the synonymous compound noun specifying unit 140 and the semantic similarity learning unit 143. For example, the similarity transition table creation unit 142 and the semantic similarity learning unit 143 perform the processing described in the first embodiment (see FIG. 6), and the learning result of the semantic similarity in each of a plurality of word pairs. The similarity transition table 195 (see FIG. 11) showing the transition of is created. That is, also in the present embodiment, the semantic similarity learning unit 143 executes the process of learning the semantic similarity with respect to one set of word pairs a plurality of times, and each time the processing is performed, a case used for learning (morphological analysis Result) is added. For example, the semantic similarity learning unit 143 sets the number of cases Ht in the m-th (m=1, 2,..., M) process for one word pair to Ht=m×H (for example, H=1000). And

ステップＳ７３の意味類似度学習処理を終えると、同義複合名詞特定部１４０は、次に、単語ペアについての同義判定に用いる判定閾値の設定処理（ステップＳ７６）を行う。ステップＳ７６の判定閾値の設定処理は、判定閾値設定部１４６が行う。判定閾値設定部１４６は、ステップＳ７３で生成した類似度推移テーブル１９５における複数の同義単語ペアについての意味類似度の学習結果に基づいて、判定閾値を設定する。具体的には、判定閾値設定部１４６は、類似度推移テーブル１９５における同義単語ペア同士の意味類似度の学習結果についての相関係数を算出し、当該相関係数の平均値を判定閾値に設定する。 When the semantic similarity learning process of step S73 is finished, the synonymous compound noun specifying unit 140 then performs a determination threshold setting process (step S76) used for synonymous determination of word pairs. The determination threshold setting unit 146 performs the determination threshold setting process in step S76. The determination threshold setting unit 146 sets the determination threshold based on the learning result of the semantic similarity regarding the plurality of synonymous word pairs in the similarity transition table 195 generated in step S73. Specifically, the determination threshold setting unit 146 calculates the correlation coefficient for the learning result of the semantic similarity between the synonymous word pairs in the similarity transition table 195, and sets the average value of the correlation coefficient as the determination threshold. To do.

ステップＳ７６の処理を終えると、同義複合名詞特定部１４０は、次に、未確定単語ペアについての同義判定処理（ステップＳ７４）を行う。ステップＳ７４の処理は、同義複合名詞特定部１４０の単語同義判定部１４４が行う。単語同義判定部１４４は、類似度推移テーブル１９５を参照して同義単語ペアの意味類似度の推移と、未確定単語ペアの意味類似度の推移との相関係数を算出し、当該相関係数が閾値以上である場合に未確定単語ペアの単語同士が同義であると判定する。単語同義判定部１４４は、単語同士が同義であると判定した未確定単語ペアを、同義単語ペアに変更する。なお、本実施形態に係るステップＳ７４の処理において、単語同義判定部１４４は、ステップＳ７６で設定した判定閾値に基づいて、未確定単語ペアが同義単語ペアであるか否かを判定する。すなわち、単語同義判定部１４４は、ステップＳ７６で設定した判定閾値を、図７ＢのステップＳ７４０８の判定に用いる。 After finishing the process of step S76, the synonymous compound noun specifying unit 140 then performs the synonym determination process (step S74) for the undetermined word pair. The process of step S74 is performed by the word synonym determination unit 144 of the synonymous compound noun specifying unit 140. The word synonym determination unit 144 refers to the similarity transition table 195, calculates the correlation coefficient between the transition of the semantic similarity of the synonymous word pair and the transition of the semantic similarity of the undetermined word pair, and the correlation coefficient Is greater than or equal to the threshold, it is determined that the words in the undetermined word pair are synonymous. The word synonym determination unit 144 changes the undetermined word pair determined to be synonymous with each other into a synonymous word pair. In the process of step S74 according to the present embodiment, the word synonym determination unit 144 determines whether the undetermined word pair is a synonym word pair based on the determination threshold set in step S76. That is, the word synonym determination unit 144 uses the determination threshold set in step S76 for the determination in step S7408 of FIG. 7B.

このように、本実施形態に係る同義複合名詞特定処理（ステップＳ７）では、複数の単語ペアにおける意味類似度の学習結果に基づいて、単語ペアが同義であるか否かの判定に用いる判定閾値を設定する。判定閾値の設定処理（ステップＳ７６）は、上記のように、判定閾値設定部１４６が行う。判定閾値設定部１４６は、判定閾値の設定処理として、例えば、図１６に示した処理を行う。 As described above, in the synonymous compound noun specifying process (step S7) according to the present embodiment, the determination threshold value used for determining whether or not the word pair is synonymous based on the learning result of the semantic similarity in the plurality of word pairs. To set. The determination threshold setting process (step S76) is performed by the determination threshold setting unit 146 as described above. The determination threshold setting unit 146 performs, for example, the processing illustrated in FIG. 16 as the determination threshold setting processing.

図１６は、判定閾値の設定処理の内容を説明するフローチャートである。
判定閾値設定部１４６は、類似度推移テーブル１９５から、同義単語ペアの意味類似度の学習結果を抽出する（ステップＳ７６０１）。ステップＳ７６０１において、判定閾値設定部１４６は、例えば、図１０の（ｂ）の単語ペアリスト４０４から属性が同義である単語ペアの配列ＩＤを取得し、該配列ＩＤの意味類似度の学習結果を類似度推移テーブル１９５から抽出する。 FIG. 16 is a flowchart illustrating the content of the determination threshold setting process.
The determination threshold setting unit 146 extracts the learning result of the semantic similarity of the synonymous word pairs from the similarity transition table 195 (step S7601). In step S7601, the determination threshold setting unit 146 acquires, for example, the array ID of a word pair having the same attribute from the word pair list 404 of FIG. 10B, and obtains the learning result of the semantic similarity of the array ID. It is extracted from the similarity transition table 195.

次に、判定閾値設定部１４６は、同義単語ペアの意味類似度同士の相関係数を算出するループ処理（ステップＳ７６０２〜Ｓ７６０４）を行う。当該ループ処理における１回の処理は、２組の同義単語ペアのそれぞれにおける意味類似度同士の相関係数を算出する処理（ステップＳ７６０３）となっている。ループ処理におけるループ端（ステップＳ７６０２，Ｓ７６０４）では、判定閾値設定部１４６は、相関係数を算出する２組の同義単語ペアを選択する処理と、全ての同義単語ペアの組み合わせで相関係数を算出したか否かの判定とを行う。処理対象となっている複数の同義単語ペアのなかから２組の同義単語ペアを抽出する場合の組み合わせの全てで相関係数を算出すると、判定閾値設定部１４６は、ループ処理を終了する。 Next, the determination threshold value setting unit 146 performs a loop process (steps S7602 to S7604) of calculating the correlation coefficient between the semantic similarities of the synonymous word pairs. One process in the loop process is a process of calculating a correlation coefficient between semantic similarities in each of two synonymous word pairs (step S7603). At the loop end (steps S7602 and S7604) in the loop processing, the determination threshold setting unit 146 selects the two synonymous word pairs for which the correlation coefficient is calculated and the correlation coefficient for all the synonymous word pair combinations. It is determined whether or not it has been calculated. When the correlation coefficient is calculated for all of the combinations when two synonymous word pairs are extracted from the plurality of synonymous word pairs that are the processing targets, the determination threshold setting unit 146 ends the loop processing.

２組の同義単語ペアにおける意味類似度の相関係数を算出するループ処理（ステップＳ７６０２〜Ｓ７６０４）を終えると、判定閾値設定部１４６は、次に、算出した意味類似度同士の相関係数の平均値を算出する（ステップＳ７６０５）。ステップＳ７６０５において、判定閾値設定部１４６は、ループ処理（ステップＳ７６０２〜Ｓ７６０４）で算出した複数の相関係数から、相関係数の平均値を算出する。 When the loop processing (steps S7602 to S7604) of calculating the correlation coefficient of the semantic similarity between two synonymous word pairs is completed, the determination threshold setting unit 146 then determines the correlation coefficient of the calculated similarities. An average value is calculated (step S7605). In step S7605, the determination threshold setting unit 146 calculates the average value of the correlation coefficients from the plurality of correlation coefficients calculated in the loop processing (steps S7602 to S7604).

その後、判定閾値設定部１４６は、算出した相関係数の平均値を単語ペアについての同義表現の判定閾値に設定し（ステップＳ７６０６）、判定閾値の設定処理を終了する。 After that, the determination threshold setting unit 146 sets the average value of the calculated correlation coefficients as the determination threshold of the synonymous expression for the word pair (step S7606), and ends the determination threshold setting process.

判定閾値の設定処理（ステップＳ７６）を終了した後、同義複合名詞特定部１４０は、上記のように、単語ペアについての同義判定処理（ステップＳ７４）を行い、未確定単語ペアが同義単語ペアであるか否かを判定する。ステップＳ７４の処理は、単語同義判定部１４４が行う。単語同義判定部１４４は、例えば、図７Ａ及び図７Ｂに示したステップＳ７４０１〜Ｓ７４１１の処理を行う。このとき、単語同義判定部１４４は、ステップＳ７４０８の判定における判定閾値として、ステップＳ７６で設定した判定閾値を用いる。 After completing the determination threshold setting process (step S76), the synonym compound noun specifying unit 140 performs the synonym determination process (step S74) for the word pair as described above, and the undetermined word pair is a synonymous word pair. Determine whether there is. The processing in step S74 is performed by the word synonym determination unit 144. The word synonym determination unit 144, for example, performs the processes of steps S7401 to S7411 illustrated in FIGS. 7A and 7B. At this time, the word synonym determination unit 144 uses the determination threshold set in step S76 as the determination threshold in the determination in step S7408.

図１７は、判定閾値の設定方法の具体例を説明する図である。
図１７の（ａ）には、第１の実施形態で例示した単語ペアリスト４０４を示している。単語ペアリスト４０４は、上記のように、単語ペア作成部１４１が同義複合名詞特定処理（ステップＳ７）におけるステップＳ７１及びＳ７２の処理を行って作成する。単語ペアリスト４０４には、複合名詞のペアから抽出した単語ペアと、文書集合１９１から収集した同義単語ペアとが登録されている。また、単語ペアリスト４０４では、各単語ペアには同義であるか否かを示す「未確定」又は「同義」の属性と、単語ペアを識別する配列ＩＤとが付与されている。 FIG. 17 is a diagram illustrating a specific example of the method of setting the determination threshold.
FIG. 17A shows the word pair list 404 exemplified in the first embodiment. As described above, the word pair list 404 is created by the word pair creation unit 141 by performing the processing of steps S71 and S72 in the synonymous compound noun identification processing (step S7). In the word pair list 404, the word pairs extracted from the compound noun pairs and the synonymous word pairs collected from the document set 191 are registered. Further, in the word pair list 404, an attribute of “undetermined” or “synonymous” indicating whether or not each word pair is synonymous, and an array ID for identifying the word pair are assigned.

本実施形態に係る判定閾値の設定処理において、判定閾値設定部１４６は、同義単語ペアの意味類似度の学習結果（推移）同士の相関係数に基づいて、未確定単語ペアが同義単語ペアであるか否かを判定する。このため、判定閾値設定部１４６は、同義単語ペアについての配列ＩＤのなかから２個の配列ＩＤを選ぶときの組み合わせ（配列ＩＤのペア）の全てで、意味類似度の相関係数を算出する。例えば、図１７の（ａ）の単語ペアリスト４０４では、配列ＩＤが１である単語ペアの属性が「未確定」になっており、配列ＩＤが２〜４、及びＮである単語ペアの属性が「同義」となっている。よって、判定閾値設定部１４６は、図１７の（ｂ）のテーブル４０６に示したように、同義単語ペアに付与された配列ＩＤのペア｛２，３｝，｛２，４｝等を生成し、配列ＩＤのペア毎に意味類似度の相関係数を算出する（ステップＳ７６０２〜Ｓ７６０４）。判定閾値設定部１４６は、上記式（１）により、意味類似度の相関係数を算出する。配列ＩＤのペアの全てで意味類似度の相関係数を算出すると、判定閾値設定部１４６は、算出した複数の相関係数から、相関係数の平均値を算出する（ステップＳ７６０５）。 In the determination threshold setting process according to the present embodiment, the determination threshold setting unit 146 determines that the undetermined word pair is a synonymous word pair based on the correlation coefficient between the learning results (transitions) of the semantic similarity of the synonymous word pair. Determine whether there is. Therefore, the determination threshold value setting unit 146 calculates the correlation coefficient of the semantic similarity for all combinations (array ID pairs) when selecting two array IDs from the array IDs of the synonymous word pairs. .. For example, in the word pair list 404 of FIG. 17A, the attribute of the word pair with the array ID 1 is “undetermined”, and the attribute of the word pair with the array IDs 2 to 4 and N is N. Is "synonymous". Therefore, as shown in the table 406 of FIG. 17B, the determination threshold setting unit 146 generates the array ID pairs {2,3}, {2,4} and the like assigned to the synonymous word pairs. , A correlation coefficient of semantic similarity is calculated for each pair of array IDs (steps S7602 to S7604). The determination threshold setting unit 146 calculates the correlation coefficient of the semantic similarity by the above formula (1). When the correlation coefficient of the semantic similarity is calculated for all the pairs of array IDs, the determination threshold setting unit 146 calculates the average value of the correlation coefficients from the calculated plurality of correlation coefficients (step S7605).

このように、本実施形態では、複合名詞のペアから抽出した未確定単語ペアが同義であるか否かの判定に用いる判定閾値を、文書集合１９１から収集した同義単語ペアについての意味類似度の相関係数に基づいて自動的に設定する。複数の同義単語ペアのそれぞれについての意味類似度の学習結果における事例数と意味類似度との関係は、図１２の（ａ）に示したように、事例数が多くなると意味類似度が高くなるという同じ傾向が見られるものの、変化量（直線の傾き）に違いがある。このため、同義表現の複合名詞を抽出する文書を含む文書集合１９１における同義単語ペアの意味類似度の相関係数に基づいて判定閾値を設定することで、文書集合１９１の文書の内容に応じた閾値により単語ペアが同義であるか否かを判定することが可能となる。よって、本実施形態によれば、未確定単語ペアが同義であるか否かの判定精度を向上させることが可能となる。 As described above, in the present embodiment, the determination threshold used for determining whether or not the undetermined word pair extracted from the pair of compound nouns is synonymous is the semantic similarity of the synonymous word pairs collected from the document set 191. It is set automatically based on the correlation coefficient. As shown in (a) of FIG. 12, the relationship between the number of cases and the meaning similarity in the learning result of the meaning similarity for each of a plurality of synonymous word pairs becomes higher as the number of cases increases. Although the same tendency is seen, there is a difference in the amount of change (the slope of the straight line). Therefore, by setting the determination threshold value based on the correlation coefficient of the semantic similarity of the synonymous word pairs in the document set 191 including the document for extracting the compound noun of the synonymous expression, the content of the document of the document set 191 can be adjusted. It is possible to determine whether the word pair is synonymous with the threshold value. Therefore, according to the present embodiment, it is possible to improve the accuracy of determining whether or not undetermined word pairs have the same meaning.

なお、図１５及び図１６のフローチャートは、本実施形態に係る同義複合名詞特定処理の一例に過ぎない。本実施形態に係る同義複合名詞特定処理は、図１５及び図１６に示した処理に限らず、本実施形態の要旨を逸脱しない範囲において処理内容を適宜変更可能である。 The flowcharts of FIGS. 15 and 16 are merely an example of the synonymous compound noun specifying process according to the present embodiment. The synonymous compound noun specifying process according to the present embodiment is not limited to the processes shown in FIGS. 15 and 16, and the processing content can be appropriately changed without departing from the scope of the present embodiment.

［第３の実施形態］
図１８は、第３の実施形態に係る同義語辞書作成システムのシステム構成を示す図である。 [Third Embodiment]
FIG. 18 is a diagram showing the system configuration of the synonym dictionary creating system according to the third embodiment.

図１８に示すように、本実施形態に係る同義語辞書作成システム５は、同義表現抽出装置１と、第１のストレージ装置６と、第２のストレージ装置７と、を含む。同義表現抽出装置１は、第１のストレージ装置６及び第２のストレージ装置７のそれぞれと伝送ケーブル等で接続されており、装置間でのデータの送受信が可能となっている。また、同義表現抽出装置１、第１のストレージ装置６、及び第２のストレージ装置７は、インターネットやLocal Area Network（ＬＡＮ）等のネットワーク８に接続されており、ネットワーク８を介して端末装置９（９Ａ〜９Ｃ）と通信可能に接続される。 As shown in FIG. 18, the synonym dictionary creating system 5 according to the present embodiment includes a synonym expression extracting device 1, a first storage device 6, and a second storage device 7. The synonymous expression extraction device 1 is connected to each of the first storage device 6 and the second storage device 7 by a transmission cable or the like, and data can be transmitted and received between the devices. The synonym expression extraction device 1, the first storage device 6, and the second storage device 7 are connected to a network 8 such as the Internet or a Local Area Network (LAN), and the terminal device 9 is connected via the network 8. (9A to 9C) is communicatively connected.

本実施形態に係る同義表現抽出装置１は、第１の実施形態又は第２の実施形態で説明したように、文書集合に含まれる複合名詞についての同義表現を抽出する。なお、本実施形態に係る同義表現抽出装置１は、第１のストレージ装置６及び第２のストレージ装置７のそれぞれとの通信（データの送受信）を行う通信部を含む。 The synonym expression extraction device 1 according to the present exemplary embodiment extracts synonymous expressions for compound nouns included in a document set, as described in the first or second exemplary embodiment. The synonym expression extraction device 1 according to this exemplary embodiment includes a communication unit that performs communication (data transmission/reception) with each of the first storage device 6 and the second storage device 7.

第１のストレージ装置６は、複合名詞の同義表現を抽出する文書集合を記憶させる装置である。例えば、第１のストレージ装置６には、図１８に示したように、第１分野の文書集合６０１、及び第２分野の文書集合６０２を含む、所定の分野の文書データのみを蓄積した複数の文書集合を記憶させる。複数の文書集合６０１，６０２には、例えば、同義語辞書作成システム５の利用者が端末装置９を利用して収集し、端末装置９から第１のストレージ装置６に転送した文書データが蓄積される。 The first storage device 6 is a device for storing a document set for extracting synonymous expressions of compound nouns. For example, in the first storage device 6, as shown in FIG. 18, a plurality of document data of a predetermined field including a first field document set 601 and a second field document set 602 are stored. Store a set of documents. In the plurality of document sets 601, 602, for example, document data collected by the user of the synonym dictionary creating system 5 using the terminal device 9 and transferred from the terminal device 9 to the first storage device 6 is accumulated. It

第２のストレージ装置７は、同義表現である複合名詞のペアを含む同義語辞書を記憶させる装置である。例えば、第２のストレージ装置７には、図１８に示したように、第１分野の同義語辞書７０１、及び第２分野の同義語辞書７０２を含む、複数の同義語辞書を記憶させる。第１分野の同義語辞書７０１は、第１分野の文書集合６０１から抽出した複合名詞についての同義表現が登録された辞書である。第２分野の同義語辞書７０２は、第２分野の文書集合７０１から抽出した複合名詞についての同義表現が登録された辞書である。複数の同義語辞書７０１，７０２には、例えば、同義表現抽出装置１により文書集合６０１，６０２から抽出した、同義表現である複合名詞のペアが登録される。同義語辞書作成システム５の利用者は、端末装置９を利用して第２のストレージ装置７の同義語辞書７０１，７０２にアクセスし、文書内の複合名詞についての他の表現（同義表現）を調べることが可能である。なお、第２のストレージ装置７に記憶させる同義語辞書は、複合名詞についての同義語だけでなく、各種単語についての同義語を含むものであってもよい。 The second storage device 7 is a device for storing a synonym dictionary including a pair of compound nouns that are synonymous expressions. For example, as shown in FIG. 18, the second storage device 7 stores a plurality of synonym dictionaries including a synonym dictionary 701 of the first field and a synonym dictionary 702 of the second field. The first-field synonym dictionary 701 is a dictionary in which synonymous expressions about compound nouns extracted from the first-field document set 601 are registered. The synonym dictionary 702 of the second field is a dictionary in which synonymous expressions about compound nouns extracted from the document set 701 of the second field are registered. In the plurality of synonym dictionaries 701 and 702, for example, pairs of compound nouns that are synonymous expressions extracted from the document sets 601 and 602 by the synonym expression extraction device 1 are registered. The user of the synonym dictionary creating system 5 accesses the synonym dictionaries 701 and 702 of the second storage device 7 by using the terminal device 9 and displays another expression (synonymous expression) about the compound noun in the document. It is possible to look it up. The synonym dictionary stored in the second storage device 7 may include not only synonyms for compound nouns but also synonyms for various words.

本実施形態に係る同義語辞書作成システム５では、例えば、システムの利用者が、端末装置９を利用して各種文書データを収集し、文書集合６０１，６０２に蓄積する処理を随時行う。また、同義語辞書作成システム５における同義表現抽出装置１は、例えば、システムの利用者が端末装置９を利用して同義表現抽出装置１に送信した命令を受信したこと、或いは予め設定した日時が到来したことを契機に、図４に示したような処理を行う。この際、同義表現抽出装置１は、第１のストレージ装置６から１つの分野の文書集合を取得して同義である複合名詞のペアを抽出する処理を行う。同義である複合名詞のペアを抽出した後、同義表現抽出装置１は、抽出した複合名詞のペアを、第２のストレージ装置７の所定の同義語辞書に登録する。なお、抽出した複合名詞のペアを同義語辞書に登録する際、同義表現抽出装置１は、同義語辞書に登録された複合名詞のペアを参照し、同義語辞書に未登録のペアのみを追加登録する。 In the synonym dictionary creating system 5 according to the present embodiment, for example, the user of the system performs a process of collecting various document data using the terminal device 9 and accumulating them in the document sets 601 and 602 as needed. In addition, the synonym expression extraction device 1 in the synonym dictionary creation system 5 receives, for example, a command transmitted by the system user to the synonym expression extraction device 1 using the terminal device 9 or a preset date and time. Upon arrival, the process as shown in FIG. 4 is performed. At this time, the synonymous expression extraction device 1 performs a process of acquiring a document set of one field from the first storage device 6 and extracting a pair of synonymous compound nouns. After extracting the synonymous compound noun pairs, the synonym expression extracting device 1 registers the extracted compound noun pairs in a predetermined synonym dictionary of the second storage device 7. When registering the extracted pair of compound nouns in the synonym dictionary, the synonym expression extracting device 1 refers to the pair of compound nouns registered in the synonym dictionary and adds only unregistered pairs in the synonym dictionary. sign up.

このように、本実施形態に係る同義語辞書作成システム５では、特定の分野の文書のみを集めた文書集合に基づいて、その分野で使用される複合名詞の同義表現を抽出し、分野毎に用意された同義語辞書に登録する（蓄積する）。更に、本実施形態に係る同義語辞書作成システム５では、第１の実施形態及び第２の実施形態で説明したように、複合名詞のペアを複数の単語ペアに分割し、複数の単語ペアが全て同義単語ペアである複合名詞のペアを同義表現であると判定する。このため、本実施形態に係る同義語辞書作成システム５では、特定の分野のみで使用される複合名詞の同義表現を効率よく、かつ精度よく抽出して同義語辞書を作成することが可能となる。 As described above, the synonym dictionary creating system 5 according to the present embodiment extracts synonymous expressions of compound nouns used in a specific field based on a document set in which only documents in a specific field are collected, and Register (store) in the prepared synonym dictionary. Further, in the synonym dictionary creating system 5 according to the present embodiment, as described in the first and second embodiments, the compound noun pair is divided into a plurality of word pairs, and the plurality of word pairs are divided into a plurality of word pairs. All compound noun pairs that are synonymous word pairs are determined to be synonymous expressions. Therefore, the synonym dictionary creating system 5 according to the present embodiment can efficiently and accurately extract synonymous expressions of compound nouns used only in a specific field to create a synonym dictionary. ..

なお、図１８のシステム構成は、本実施形態に係る同義語辞書作成システム５のシステム構成の一例に過ぎない。本実施形態に係る同義語辞書作成システム５は、図１８のシステム構成に限らず、適宜変更可能である。例えば、第１のストレージ装置６には１つの特定分野の文書集合のみを記憶させ、第２のストレージ装置７に当該特定分野の同義語辞書のみを記憶させもよい。また、第２のストレージ装置７には、１個の同義語辞書を用意し、複数の文書集合のそれぞれから抽出した複合名詞の同義表現を当該１個の同義語辞書にまとめて登録してもよい。更に、同義語辞書作成システム５は、第１のストレージ装置６と、第２のストレージ装置７との代わりに、文書集合及び同義語辞書を記憶する１個のストレージ装置を備えた構成であってもよい。 The system configuration of FIG. 18 is merely an example of the system configuration of the synonym dictionary creating system 5 according to this embodiment. The synonym dictionary creation system 5 according to the present embodiment is not limited to the system configuration of FIG. 18, and can be changed as appropriate. For example, the first storage device 6 may store only a set of documents in one specific field, and the second storage device 7 may store only the synonym dictionary in the specific field. In addition, even if one synonym dictionary is prepared in the second storage device 7 and synonymous expressions of compound nouns extracted from each of a plurality of document sets are collectively registered in the one synonym dictionary. Good. Further, the synonym dictionary creating system 5 is configured to include one storage device that stores a document set and a synonym dictionary instead of the first storage device 6 and the second storage device 7. Good.

［第４の実施形態］
図１９は、第４の実施形態に係る文書書換システムのシステム構成を示す図である。 [Fourth Embodiment]
FIG. 19 is a diagram showing the system configuration of the document rewriting system according to the fourth embodiment.

図１９に示すように、本実施形態に係る文書書換システム１０は、同義表現抽出装置１と、第１のストレージ装置６と、第２のストレージ装置７と、文書データ書換装置１１と、を含む。同義表現抽出装置１は、第１のストレージ装置６及び第２のストレージ装置７のそれぞれと伝送ケーブル等で接続されており、装置間でのデータの送受信が可能となっている。また、文書データ書換装置１１は、第２のストレージ装置７と伝送ケーブル等で接続されており、装置間でのデータの送受信が可能となっている。更に、同義表現抽出装置１、第１のストレージ装置６、第２のストレージ装置７、及び文書データ書換装置１１は、インターネットやLocal Area Network（ＬＡＮ）等のネットワーク８に接続されており、ネットワーク８を介して端末装置９（９Ａ〜９Ｃ）と通信可能に接続される。 As shown in FIG. 19, the document rewriting system 10 according to the present embodiment includes a synonym expression extracting device 1, a first storage device 6, a second storage device 7, and a document data rewriting device 11. .. The synonymous expression extraction device 1 is connected to each of the first storage device 6 and the second storage device 7 by a transmission cable or the like, and data can be transmitted and received between the devices. Further, the document data rewriting device 11 is connected to the second storage device 7 by a transmission cable or the like, and data can be transmitted and received between the devices. Further, the synonym expression extracting device 1, the first storage device 6, the second storage device 7, and the document data rewriting device 11 are connected to a network 8 such as the Internet or a Local Area Network (LAN), and the network 8 The terminal device 9 (9A to 9C) is communicatively connected via the.

第１のストレージ装置６は、複合名詞の同義表現を抽出する文書集合を記憶させる装置である。例えば、第１のストレージ装置６には、図１８に示したように、第１分野の文書集合６０１、及び第２分野の文書集合６０２を含む、所定の分野の文書データのみを蓄積した複数の文書集合を記憶させる。複数の文書集合６０１，６０２には、例えば、文書書換システム１０の利用者が端末装置９を利用して収集し、端末装置９から第１のストレージ装置６に転送した文書データが蓄積される。 The first storage device 6 is a device for storing a document set for extracting synonymous expressions of compound nouns. For example, in the first storage device 6, as shown in FIG. 18, a plurality of document data of a predetermined field including a first field document set 601 and a second field document set 602 are stored. Store a set of documents. In the plurality of document sets 601, 602, for example, document data collected by the user of the document rewriting system 10 using the terminal device 9 and transferred from the terminal device 9 to the first storage device 6 is accumulated.

第２のストレージ装置７は、同義表現である複合名詞のペアを含む同義語辞書を記憶させる装置である。例えば、第２のストレージ装置７には、図１８に示したように、第１分野の同義語辞書７０１、及び第２分野の同義語辞書７０２を含む、複数の同義語辞書を記憶させる。第１分野の同義語辞書７０１は、第１分野の文書集合６０１から抽出した複合名詞についての同義表現が登録された辞書である。第２分野の同義語辞書７０２は、第２分野の文書集合７０１から抽出した複合名詞についての同義表現が登録された辞書である。複数の同義語辞書７０１，７０２には、例えば、同義表現抽出装置１により文書集合６０１，６０２から抽出した、同義表現である複合名詞のペアが登録される。文書書換システム１０の利用者は、端末装置９を利用して第２のストレージ装置７の同義語辞書７０１，７０２にアクセスし、文書内の複合名詞についての他の表現（同義表現）を調べることが可能である。なお、第２のストレージ装置７に記憶させる同義語辞書は、複合名詞についての同義語だけでなく、各種単語についての同義語を含むものであってもよい。 The second storage device 7 is a device for storing a synonym dictionary including a pair of compound nouns that are synonymous expressions. For example, as shown in FIG. 18, the second storage device 7 stores a plurality of synonym dictionaries including a synonym dictionary 701 of the first field and a synonym dictionary 702 of the second field. The first-field synonym dictionary 701 is a dictionary in which synonymous expressions about compound nouns extracted from the first-field document set 601 are registered. The synonym dictionary 702 of the second field is a dictionary in which synonymous expressions about compound nouns extracted from the document set 701 of the second field are registered. In the plurality of synonym dictionaries 701 and 702, for example, pairs of compound nouns that are synonymous expressions extracted from the document sets 601 and 602 by the synonym expression extraction device 1 are registered. The user of the document rewriting system 10 uses the terminal device 9 to access the synonym dictionaries 701 and 702 of the second storage device 7 to check other expressions (synonymous expressions) about the compound noun in the document. Is possible. The synonym dictionary stored in the second storage device 7 may include not only synonyms for compound nouns but also synonyms for various words.

文書データ書換装置１１は、第２のストレージ装置７に記憶させた同義語辞書７０１，７０２を参照し、文書データに含まれる複合名詞を同義表現に書き換える装置である。文書データ書換装置１１は、例えば、文書書換システム１１の利用者が端末装置９を利用して文書データ書換装置１１に送信した文書データから複合名詞を抽出する。また、文書データ書換装置１１は、第２のストレージ装置７の同義語辞書を検索し、文書データから抽出した複合名詞の他の表現（同義表現）を取得する。文書データから抽出した複合名詞に対する同義表現が存在する場合、文書データ書換装置１１は、複合名詞を他の同義表現に書き換えるか否かを判定する。例えば、文書データ書換装置１１は、文書データから抽出した複合名詞と、同義語辞書から取得した同義表現との優先度に基づいて、複合名詞を書き換えるか否かを判定する。同義表現の優先度は、例えば、文書集合における出現頻度の多さ等に基づいて設定する。同義語辞書から取得した同義表現のほうが優先度の高い表現である場合、文書データ書換装置１１は、文書データ内の複合名詞を同義語辞書から取得した同義表現に書き換える。上記の各処理を終えると、文書書換データ装置１１は、文書データを端末装置９に返信する。 The document data rewriting device 11 is a device that refers to the synonym dictionaries 701 and 702 stored in the second storage device 7 and rewrites a compound noun included in the document data into a synonymous expression. The document data rewriting device 11 extracts, for example, a compound noun from the document data transmitted to the document data rewriting device 11 by the user of the document rewriting system 11 using the terminal device 9. Further, the document data rewriting device 11 searches the synonym dictionary of the second storage device 7 and acquires another expression (synonymous expression) of the compound noun extracted from the document data. When there is a synonymous expression for the compound noun extracted from the document data, the document data rewriting device 11 determines whether to rewrite the compound noun with another synonymous expression. For example, the document data rewriting device 11 determines whether to rewrite the compound noun based on the priority of the compound noun extracted from the document data and the synonymous expression acquired from the synonym dictionary. The priority of the synonymous expression is set based on, for example, the frequency of appearance in the document set. When the synonym expression acquired from the synonym dictionary has a higher priority, the document data rewriting device 11 rewrites the compound noun in the document data with the synonym expression acquired from the synonym dictionary. When the above processes are completed, the document rewriting data device 11 returns the document data to the terminal device 9.

図２０は、文書データ書換装置の機能的構成を示す図である。
図２０に示すように、文書データ書換装置１１は、文書データ取得部１１１０と、文字列抽出部１１２０と、形態素解析部１１３０と、複合名詞抽出部１１４０と、同義表現検索部１１５０と、同義表現書換部１１６０と、文書データ返信部１１７０と、を備える。 FIG. 20 is a diagram showing a functional configuration of the document data rewriting device.
As shown in FIG. 20, the document data rewriting device 11 includes a document data acquisition unit 1110, a character string extraction unit 1120, a morpheme analysis unit 1130, a compound noun extraction unit 1140, a synonym expression search unit 1150, and a synonym expression. A rewriting unit 1160 and a document data replying unit 1170 are provided.

文書データ取得部１１１０は、端末装置９から文書データ書換装置１１に送信された文書データの入力を受け付ける。文書データ返信部１１７０は、文書データ書換装置１１により書換処理を行った文書データを端末装置９に返信する。 The document data acquisition unit 1110 receives the input of the document data transmitted from the terminal device 9 to the document data rewriting device 11. The document data reply unit 1170 returns the document data, which has been rewritten by the document data rewriting device 11, to the terminal device 9.

文字列抽出部１１２０、及び形態素解析部１１３０は、それぞれ、第１の実施形態に係る同義表現抽出装置１における文字列抽出部１１０、及び形態素解析部１２０と同様の機能を持つ。文字列抽出部１１２０は、書換処理の対象である文書データから文字列を抽出する。形態素解析部１１３０は、解析用辞書１１９０を参照し、抽出した文字列に対する形態素解析を行う。 The character string extraction unit 1120 and the morpheme analysis unit 1130 have the same functions as the character string extraction unit 110 and the morpheme analysis unit 120 of the synonymous expression extraction device 1 according to the first embodiment, respectively. The character string extraction unit 1120 extracts a character string from the document data that is the target of the rewriting process. The morpheme analysis unit 1130 refers to the analysis dictionary 1190 and performs morpheme analysis on the extracted character string.

複合名詞抽出部１１４０は、形態素解析の結果に基づいて、文書データの文字列に含まれる複合名詞を抽出する。 The compound noun extraction unit 1140 extracts a compound noun included in the character string of the document data based on the result of the morphological analysis.

同義表現検索部１１５０は、第２のストレージ装置７の同義語辞書を検索し、文字列から抽出した複合名詞についての同義表現を取得する。 The synonym expression search unit 1150 searches the synonym dictionary of the second storage device 7 and acquires the synonym expression for the compound noun extracted from the character string.

同義表現書換部１１６０は、文書データの文字列に含まれる複合名詞を、同義語辞書から取得した同義表現に書き換える。なお、同義表現書換部１１６０は、文書データの文字列に含まれる複合名詞よりも、同義語辞書から取得した同義表現のほうが優先度の高い表現である場合に、複合名詞を同義表現に書き換える。 The synonym expression rewriting unit 1160 rewrites the compound noun included in the character string of the document data into the synonym expression acquired from the synonym dictionary. The synonym expression rewriting unit 1160 rewrites the compound noun to the synonym expression when the synonym expression obtained from the synonym dictionary has a higher priority than the compound noun included in the character string of the document data.

本実施形態に係る文書書換システム１０における文書データ書換装置１１は、端末装置９等から文書データにおける複合名詞の書き換えを指示する命令が入力されると、該命令で指定された文書データを取得して図２１に示した処理を行う。 When a command for rewriting a compound noun in document data is input from the terminal device 9 or the like, the document data rewriting device 11 in the document rewriting system 10 according to the present exemplary embodiment acquires the document data designated by the command. 21 to perform the processing shown in FIG.

図２１は、文書データ書換装置が行う処理を説明するフローチャートである。
書換処理の対象である文書データを取得した文書データ書換装置１１は、まず、文書データから文字列を抽出する（ステップＳ１１）。ステップＳ１１の処理は、文字列抽出部１１２０が行う。 FIG. 21 is a flowchart illustrating the processing performed by the document data rewriting device.
The document data rewriting device 11 that has acquired the document data that is the target of the rewriting process first extracts a character string from the document data (step S11). The processing of step S11 is performed by the character string extraction unit 1120.

次に、文書データ書換装置１１は、抽出した文字列に対する形態素解析を行う（ステップＳ１２）。ステップＳ１２の処理は、形態素解析部１１３０が行う。形態素解析部１１３０は、既知の解析方法に従い、解析用辞書１１９０を参照して文字列を形態素に分割する。 Next, the document data rewriting device 11 performs morphological analysis on the extracted character string (step S12). The process of step S12 is performed by the morpheme analysis unit 1130. The morpheme analysis unit 1130 refers to the analysis dictionary 1190 and divides the character string into morphemes according to a known analysis method.

次に、文書データ書換装置１１は、形態素解析の結果に基づいて、文字列から複合名詞を抽出する（ステップＳ１３）。ステップＳ１３の処理は、複合名詞抽出部１１４０が行う。複合名詞抽出部１１４０は、複合名詞の抽出条件に従い、文字列から複合名詞を抽出する。例えば、複合名詞の抽出条件は、名詞である単語（形態素）が複数個連続した単語列を複合名詞として抽出する、という条件を含む。 Next, the document data rewriting device 11 extracts a compound noun from the character string based on the result of the morphological analysis (step S13). The process of step S13 is performed by the compound noun extraction unit 1140. The compound noun extraction unit 1140 extracts a compound noun from the character string according to the compound noun extraction condition. For example, the compound noun extraction condition includes a condition that a word string in which a plurality of noun words (morphemes) are consecutive is extracted as a compound noun.

次に、文書データ書換装置１１は、同義語辞書を検索して文書データから抽出した複合名詞を同義表現に書き換えるループ処理（ステップＳ１４〜Ｓ１９）を行う。当該ループ処理は、同義表現検索部１１５０と、同義表現書換部１１６０とが行う。ループ処理における１回の処理（ステップＳ１５〜Ｓ１８）は、文書データから抽出した１種類の複合名詞についての同義表現を検索して書き換える処理となっている。ループ処理におけるループ端（ステップＳ１４，Ｓ１９）では、同義表現検索部１１５０が、文書データから抽出した複合名詞のうちの１種類を選択する処理と、文書データから抽出した全ての複合名詞についての同義表現を検索したか否かの判定を行う。抽出した１種類又は複数種類の複合名詞の全てで同義表現を検索し必要に応じて書き換えを行うと、同義表現検索部１１５０は、ループ処理を終了する。 Next, the document data rewriting device 11 performs a loop process (steps S14 to S19) of retrieving the synonym dictionary and rewriting the compound noun extracted from the document data into a synonymous expression. The loop process is performed by the synonym expression search unit 1150 and the synonym expression rewriting unit 1160. One process (steps S15 to S18) in the loop process is a process of retrieving and rewriting a synonymous expression for one type of compound noun extracted from the document data. At the loop end (steps S14 and S19) in the loop processing, the synonym expression search unit 1150 selects one type of compound nouns extracted from the document data and the synonyms for all compound nouns extracted from the document data. It is judged whether or not the expression is searched. When the synonym expression is searched for in all of the extracted one or more types of compound nouns and rewriting is performed as necessary, the synonym expression search unit 1150 ends the loop processing.

ループ処理（ステップＳ１４〜Ｓ１９）における１回の処理では、まず、同義表現検索部１１５０が、処理対象に選択された複合名詞をキーワードとして、第２のストレージ装置７の同義語辞書を検索する（ステップＳ１５）。 In one processing in the loop processing (steps S14 to S19), first, the synonym expression search unit 1150 searches the synonym dictionary of the second storage device 7 using the compound noun selected as the processing target as a keyword ( Step S15).

次に、同義表現検索部１１５０は、処理対象の複合名詞についての他の同義表現が同義語辞書に登録されているか否かを判定する（ステップＳ１６）。他の同義表現が同義語辞書に登録されていない場合（ステップＳ１６；ＮＯ）、同義表現検索部１１５０は、ループ処理のループ端（ステップＳ１９）において、ループ処理を終了するか否かを判定する。 Next, the synonym expression search unit 1150 determines whether another synonym expression for the compound noun to be processed is registered in the synonym dictionary (step S16). When another synonym expression is not registered in the synonym dictionary (step S16; NO), the synonym expression search unit 1150 determines whether or not to end the loop process at the loop end (step S19) of the loop process. ..

処理対象の複合名詞についての他の同義表現が同義語辞書に登録されている場合（ステップＳ１６；ＹＥＳ）、同義表現検索部１１５０は、次に、文書データから抽出した複合名詞よりも他の同義表現のほうが優先度が高いか否かを判定する（ステップＳ１７）。文書データから抽出した複合名詞のほうが優先度の高い表現である場合（ステップＳ１７；ＮＯ）、同義表現検索部１１５０は、ループ処理のループ端（ステップＳ１９）において、ループ処理を終了するか否かを判定する。 When another synonym expression for the compound noun to be processed is registered in the synonym dictionary (step S16; YES), the synonym expression search unit 1150 next selects a synonym other than the compound noun extracted from the document data. It is determined whether the expression has a higher priority (step S17). When the compound noun extracted from the document data has a higher priority expression (step S17; NO), the synonym expression search unit 1150 determines whether or not to end the loop processing at the loop end (step S19) of the loop processing. To judge.

一方、他の同義表現のほうが優先度の高い表現である場合（ステップＳ１７；ＹＥＳ）、同義表現検索部１１５０は、次に、同義表現書換部１１６０に、文書データ中の複合名詞を他の同義表現に書き換えさせる（ステップＳ１８）。その後、同義表現検索部１１５０は、ループ処理のループ端（ステップＳ１９）において、ループ処理を終了するか否かを判定する。 On the other hand, when another synonym expression has a higher priority (step S17; YES), the synonym expression retrieving unit 1150 then causes the synonym expression rewriting unit 1160 to add the compound noun in the document data to another synonym The expression is rewritten (step S18). After that, the synonym expression search unit 1150 determines whether or not to end the loop processing at the loop end of the loop processing (step S19).

ループ処理のループ端（ステップＳ１９）において、ループ処理を終了すると判定した場合、同義表現検索部１１５０は、ループ処理を終了する。ループ処理を終了した後、同義表現検索部１１５０は、例えば、同義表現書換部１１６０に、複合名詞を他の同義表現に書き換えた文書データを端末装置９に返送させる（ステップＳ２０）。 When it is determined that the loop processing is to be ended at the loop end of the loop processing (step S19), the synonym expression search unit 1150 ends the loop processing. After completing the loop processing, the synonym expression search unit 1150 causes the synonym expression rewriting unit 1160, for example, to return the document data in which the compound noun is rewritten to another synonymous expression to the terminal device 9 (step S20).

文書データを端末装置９に返送すると、文書データ書換装置１１は、１個の文書データに対する書換処理を終了する。 When the document data is returned to the terminal device 9, the document data rewriting device 11 finishes the rewriting process for one piece of document data.

このように、本実施形態に係る文書書換システム１０では、分野毎の文書集合に基づいて第１の実施形態又は第２の実施形態に説明した方法で作成した同義語辞書を参照し、文書データの複合名詞を他の同義表現に書き換える。また、文書書換システム１０では、上記のように、文書データから抽出した複合名詞よりも、該複合名詞の他の同義表現のほうが優先度の高い表現である場合に、文書データの複合名詞を他の同義表現に書き換える。このため、本実施形態によれば、文書データに存在する同義表現の複合名詞を１個の表記に統一することが可能となる。 As described above, in the document rewriting system 10 according to the present exemplary embodiment, the synonym dictionary created by the method described in the first exemplary embodiment or the second exemplary embodiment is referred to based on the document set for each field, and the document data Rewrite the compound noun of to another synonymous expression. Further, in the document rewriting system 10, as described above, when another synonymous expression of the compound noun has a higher priority than the compound noun extracted from the document data, the compound noun of the document data is changed to another. Rewrite as a synonym for. Therefore, according to the present embodiment, it is possible to unify the compound nouns of the synonymous expressions existing in the document data into one notation.

なお、図２１のフローチャートは、本実施形態に係る文書データ書換装置１１が行う処理の一例に過ぎない。本実施形態に係る文書データ書換装置１１が行う処理は、図２１に示した処理に限らず、本実施形態の要旨を逸脱しない範囲において処理内容を適宜変更可能である。 The flowchart of FIG. 21 is only an example of the process performed by the document data rewriting device 11 according to the present exemplary embodiment. The processing performed by the document data rewriting device 11 according to the present embodiment is not limited to the processing shown in FIG. 21, and the processing content can be appropriately changed without departing from the scope of the present embodiment.

また、図１９のシステム構成は、本実施形態に係る文書書換システム１０のシステム構成の一例に過ぎない。本実施形態に係る文書書換システム１０は、図１９に示した構成に限らず、例えば、同義表現抽出装置１と文書データ書換装置１１とが一体化されていてもよい。更に、文書集合と同義語辞書とは、１個のストレージ装置に記憶させてもよい。 Further, the system configuration of FIG. 19 is merely an example of the system configuration of the document rewriting system 10 according to the present exemplary embodiment. The document rewriting system 10 according to the present embodiment is not limited to the configuration shown in FIG. 19, and for example, the synonym expression extracting device 1 and the document data rewriting device 11 may be integrated. Further, the document set and the synonym dictionary may be stored in one storage device.

加えて、上記の各実施形態に係る同義表現抽出装置１は、コンピュータと、当該コンピュータに実行させるプログラムとにより実現可能である。以下、図２２を参照して、コンピュータとプログラムとにより実現される同義表現抽出装置１について説明する。 In addition, the synonymous expression extraction device 1 according to each of the above embodiments can be realized by a computer and a program executed by the computer. Hereinafter, the synonym expression extraction device 1 realized by a computer and a program will be described with reference to FIG.

図２２は、コンピュータのハードウェア構成を示す図である。
図２２に示すように、コンピュータ１５は、プロセッサ１５０１と、主記憶装置１５０２と、補助記憶装置１５０３と、入力装置１５０４と、出力装置１５０５と、入出力インタフェース１５０６と、通信制御装置１５０７と、媒体駆動装置１５０８と、を備える。コンピュータ１５におけるこれらの要素１５０１〜１５０８は、バス１５１０により相互に接続されており、要素間でのデータの受け渡しが可能になっている。 FIG. 22 is a diagram showing a hardware configuration of a computer.
As shown in FIG. 22, the computer 15 includes a processor 1501, a main storage device 1502, an auxiliary storage device 1503, an input device 1504, an output device 1505, an input/output interface 1506, a communication control device 1507, and a medium. And a driving device 1508. These elements 1501 to 1508 in the computer 15 are connected to each other by a bus 1510, and data can be transferred between the elements.

プロセッサ１５０１は、Central Processing Unit（ＣＰＵ）やMicro Processing Unit（ＭＰＵ）等である。プロセッサ１５０１は、オペレーティングシステムを含む各種のプログラムを実行することにより、コンピュータ１５の全体の動作を制御する。また、プロセッサ１５０１は、例えば、図４〜図８に示した同義表現の複合名詞のペアを抽出する処理を含む同義表現抽出プログラムを実行する。 The processor 1501 is a Central Processing Unit (CPU), a Micro Processing Unit (MPU), or the like. The processor 1501 controls the overall operation of the computer 15 by executing various programs including an operating system. Further, the processor 1501 executes, for example, a synonym expression extraction program including a process of extracting a pair of compound nouns of synonymous expressions shown in FIGS. 4 to 8.

主記憶装置１５０２は、図示しないRead Only Memory（ＲＯＭ）及びRandom Access Memory（ＲＡＭ）を含む。主記憶装置１５０２のＲＯＭには、例えば、コンピュータ１５の起動時にプロセッサ１５０１が読み出す所定の基本制御プログラム等が予め記録されている。一方、主記憶装置１５０２のＲＡＭは、プロセッサ１５０１が、各種のプログラムを実行する際に必要に応じて作業用記憶領域として使用する。主記憶装置１５０２のＲＡＭは、例えば、文書集合１９１から抽出した文字列、解析結果コーパス１９３、未確定単語ペア及び同義単語ペアのリスト、類似度推移テーブル１９５等の記憶に利用可能である。また、主記憶装置１５０２のＲＡＭは、例えば、解析用辞書１９２、同義単語辞書１９４、同義表現リスト１９６等の記憶に利用可能である。 The main storage device 1502 includes a Read Only Memory (ROM) and a Random Access Memory (RAM) which are not shown. In the ROM of the main storage device 1502, for example, a predetermined basic control program read by the processor 1501 when the computer 15 is started is recorded in advance. On the other hand, the RAM of the main storage device 1502 is used by the processor 1501 as a work storage area as needed when executing various programs. The RAM of the main storage device 1502 can be used to store, for example, a character string extracted from the document set 191, an analysis result corpus 193, a list of undetermined word pairs and synonymous word pairs, a similarity transition table 195, and the like. The RAM of the main storage device 1502 can be used to store, for example, the analysis dictionary 192, the synonym word dictionary 194, the synonym expression list 196, and the like.

補助記憶装置１５０３は、主記憶装置１５０２のＲＡＭと比べて容量の大きい記憶装置であり、例えば、Hard Disk Drive（ＨＤＤ）や、フラッシュメモリのような不揮発性メモリ（Solid State Drive（ＳＳＤ）を含む）等である。補助記憶装置１５０３は、プロセッサ１５０１によって実行される各種のプログラムや各種のデータ等の記憶に利用可能である。補助記憶装置１５０３は、例えば、図４〜図８に示した同義表現の複合名詞のペアを抽出する処理を含む同義表現抽出プログラムの記憶に利用可能である。また、補助記憶装置１５０３は、例えば、文書集合１９１から抽出した文字列、解析結果コーパス１９３、未確定単語ペア及び同義単語ペアのリスト、類似度推移テーブル１９５等の記憶に利用可能である。更に、補助記憶装置１５０３は、例えば、解析用辞書１９２、同義単語辞書１９４、同義表現リスト１９６等の記憶に利用可能である。 The auxiliary storage device 1503 is a storage device having a larger capacity than the RAM of the main storage device 1502, and includes, for example, a hard disk drive (HDD) and a nonvolatile memory (Solid State Drive (SSD)) such as a flash memory. ) Etc. The auxiliary storage device 1503 can be used to store various programs executed by the processor 1501 and various data. The auxiliary storage device 1503 can be used, for example, for storing a synonym expression extraction program including a process of extracting a pair of synonymous compound nouns shown in FIGS. 4 to 8. The auxiliary storage device 1503 can be used to store, for example, a character string extracted from the document set 191, an analysis result corpus 193, a list of undetermined word pairs and synonymous word pairs, a similarity transition table 195, and the like. Further, the auxiliary storage device 1503 can be used to store, for example, the analysis dictionary 192, the synonym word dictionary 194, the synonym expression list 196, and the like.

入力装置１５０４は、例えば、キーボード装置やタッチパネル装置等である。コンピュータ１５のオペレータ（利用者）が入力装置１５０４に対して所定の操作を行うと、入力装置１５０４は、その操作内容に対応付けられている入力情報をプロセッサ１５０１に送信する。入力装置１５０４は、例えば、複合名詞についての同義表現の抽出処理を開始させる命令、コンピュータ１５が実行可能な他の処理に関する命令等の入力や、各種設定値の入力等に利用可能である。 The input device 1504 is, for example, a keyboard device, a touch panel device, or the like. When an operator (user) of the computer 15 performs a predetermined operation on the input device 1504, the input device 1504 transmits the input information associated with the operation content to the processor 1501. The input device 1504 can be used, for example, to input a command to start a process of extracting a synonymous expression for a compound noun, a command related to another process that can be executed by the computer 15, and various setting values.

出力装置１５０５は、例えば、液晶表示装置等の表示装置やプリンタ等の印刷装置である。出力装置１５０５は、複合名詞についての同義表現の抽出処理の結果や、同義表現リストの内容の出力に利用可能である。 The output device 1505 is, for example, a display device such as a liquid crystal display device or a printing device such as a printer. The output device 1505 can be used for outputting the result of the synonym expression extraction process for the compound noun and the content of the synonym expression list.

入出力インタフェース１５０６は、コンピュータ１５と、他の電子機器とを接続する。入出力インタフェース１５０６は、例えば、Universal Serial Bus（ＵＳＢ）規格のコネクタ等を備える。入出力インタフェース１５０６は、例えば、コンピュータ１５とストレージ装置６，７等との接続に利用可能である。 The input/output interface 1506 connects the computer 15 to another electronic device. The input/output interface 1506 includes, for example, a Universal Serial Bus (USB) standard connector and the like. The input/output interface 1506 can be used to connect the computer 15 to the storage devices 6 and 7, for example.

通信制御装置１５０７は、コンピュータ１５をインターネット等のネットワークに接続し、ネットワークを介したコンピュータ１５と他の通信機器との各種通信を制御する装置である。通信制御装置１５０７は、例えば、コンピュータ１５と、端末装置９との通信に利用可能である。 The communication control device 1507 is a device that connects the computer 15 to a network such as the Internet and controls various types of communication between the computer 15 and other communication devices via the network. The communication control device 1507 can be used for communication between the computer 15 and the terminal device 9, for example.

媒体駆動装置１５０８は、可搬型記憶媒体１６に記録されているプログラムやデータの読み出し、補助記憶装置１５０３に記憶されたデータ等の可搬型記憶媒体１６への書き込みを行う。媒体駆動装置１５０８には、例えば、１種類又は複数種類の規格に対応したメモリカード用リーダ／ライタが利用可能である。媒体駆動装置１５０８としてメモリカード用リーダ／ライタを用いる場合、可搬型記憶媒体１６としては、メモリカード用リーダ／ライタが対応している規格、例えば、Secure Digital（ＳＤ）規格のメモリカード（フラッシュメモリ）等を利用可能である。また、可搬型記録媒体１６としては、例えば、ＵＳＢ規格のコネクタを備えたフラッシュメモリが利用可能である。更に、コンピュータ１５が媒体駆動装置１５０８として利用可能な光ディスクドライブを搭載している場合、当該光ディスクドライブで認識可能な各種の光ディスクを可搬型記録媒体１６として利用可能である。可搬型記録媒体１６として利用可能な光ディスクには、例えば、Compact Disc（ＣＤ）、Digital Versatile Disc（ＤＶＤ）、Blu-ray Disc（Blu-rayは登録商標）等がある。可搬型記録媒体１６は、例えば、図４〜図８に示した同義表現の複合名詞のペアを抽出する処理を含む同義表現抽出プログラムの記憶に利用可能である。また、可搬型記録媒体１６は、例えば、文書集合１９１から抽出した文字列、解析結果コーパス１９３、未確定単語ペア及び同義単語ペアのリスト、類似度推移テーブル１９５等の記憶に利用可能である。更に、可搬型記録媒体１６は、例えば、解析用辞書１９２、同義単語辞書１９４、同義表現リスト１９６等の記憶に利用可能である。 The medium driving device 1508 reads programs and data recorded in the portable storage medium 16 and writes data and the like stored in the auxiliary storage device 1503 to the portable storage medium 16. For the medium driving device 1508, for example, a memory card reader/writer compatible with one or a plurality of types of standards can be used. When a memory card reader/writer is used as the medium driving device 1508, the portable storage medium 16 is a memory card (flash memory) of a standard supported by the memory card reader/writer, for example, Secure Digital (SD) standard. ) Etc. can be used. As the portable recording medium 16, for example, a flash memory having a USB standard connector can be used. Further, when the computer 15 is equipped with an optical disk drive that can be used as the medium driving device 1508, various optical disks that can be recognized by the optical disk drive can be used as the portable recording medium 16. Optical discs that can be used as the portable recording medium 16 include, for example, Compact Disc (CD), Digital Versatile Disc (DVD), and Blu-ray Disc (Blu-ray is a registered trademark). The portable recording medium 16 can be used, for example, for storing a synonym expression extraction program including a process of extracting a pair of compound nouns of synonymous expressions shown in FIGS. 4 to 8. Further, the portable recording medium 16 can be used to store, for example, a character string extracted from the document set 191, an analysis result corpus 193, a list of undetermined word pairs and synonymous word pairs, a similarity transition table 195, and the like. Further, the portable recording medium 16 can be used to store, for example, the analysis dictionary 192, the synonym word dictionary 194, the synonym expression list 196, and the like.

例えば、オペレータが入力装置１５０４等を利用して同義表現の抽出処理を開始する命令をコンピュータ１５に入力すると、プロセッサ１５０１が、補助記憶装置１５０３等の非一時的な記録媒体に記憶させた同義表現抽出プログラムを読み出して実行する。この処理において、プロセッサ１５０１は、同義表現抽出装置１における文字列抽出部１１０、形態素解析部１２０、複合名詞抽出部１３０、及び同義複合名詞特定部１４０として機能する（動作する）。また、プロセッサ１５０１が同義表現抽出プログラムを実行している間、主記憶装置１５０２のＲＡＭや補助記憶装置１５０３等は、同義表現抽出装置１の図示しない記憶部として機能する。すなわち、主記憶装置１５０２のＲＡＭや補助記憶装置１５０３等は、文書集合１９１、解析用辞書１９２、解析結果コーパス１９３、同義単語辞書１９４、類似度推移テーブル１９５、及び同義表現リスト１９６等を記憶する記憶部として機能する。 For example, when the operator uses the input device 1504 or the like to input a command to start the extraction processing of the synonym expression to the computer 15, the processor 1501 stores the synonym expression stored in a non-temporary recording medium such as the auxiliary storage device 1503. Read and execute the extraction program. In this process, the processor 1501 functions (operates) as the character string extraction unit 110, the morpheme analysis unit 120, the compound noun extraction unit 130, and the synonymous compound noun identification unit 140 in the synonymous expression extraction device 1. Further, while the processor 1501 is executing the synonym expression extraction program, the RAM of the main storage device 1502, the auxiliary storage device 1503, and the like function as a storage unit (not shown) of the synonym expression extraction device 1. That is, the RAM of the main storage device 1502, the auxiliary storage device 1503, and the like store the document set 191, the analysis dictionary 192, the analysis result corpus 193, the synonym word dictionary 194, the similarity transition table 195, the synonym expression list 196, and the like. Functions as a storage unit.

なお、同義表現抽出装置１として動作させるコンピュータ１５は、図２２に示した全ての要素１５０１〜１５０８を含む必要はなく、用途や条件に応じて一部の要素を省略することも可能である。例えば、コンピュータ１５は、通信制御装置１５０７や媒体駆動装置１５０８が省略されたものであってもよい。 The computer 15 that operates as the synonymous expression extraction device 1 does not need to include all the elements 1501 to 1508 shown in FIG. 22, and some elements may be omitted depending on the application and conditions. For example, the computer 15 may be one in which the communication control device 1507 and the medium driving device 1508 are omitted.

また、コンピュータ１５における補助記憶装置１５０３等の記憶装置は、例えば、第３の実施形態で示した同義語辞書作成システムにおける第１のストレージ装置６及び第２のストレージ装置７として利用することも可能である。 A storage device such as the auxiliary storage device 1503 in the computer 15 can also be used as, for example, the first storage device 6 and the second storage device 7 in the synonym dictionary creating system shown in the third embodiment. Is.

更に、コンピュータ１５は、同義表現抽出装置１として動作させるだけでなく、第４の実施形態で示した文書書換システム１０の文書データ書換装置１１として動作させることも可能である。また、文書書換システム１０では、１台のコンピュータ１５を、同義表現抽出装置１として動作させるとともに、文書データ書換装置１１として動作させることも可能である。 Further, the computer 15 can be operated not only as the synonym expression extraction device 1 but also as the document data rewriting device 11 of the document rewriting system 10 shown in the fourth exemplary embodiment. Further, in the document rewriting system 10, it is possible to operate one computer 15 as the synonymous expression extracting device 1 and also as the document data rewriting device 11.

以上記載した各実施形態に関し、更に以下の付記を開示する。
（付記１）
文書データから抽出した複合名詞のペアを複数の単語ペアに分割し、同義である単語ペアが登録された同義単語辞書を参照して、前記複数の単語ペアを同義単語ペアと、同義であるか否かが確定していない未確定単語ペアと同定にする、単語ペア設定部と、
前記未確定単語ペアと、前記複合名詞のペアにおける前記同義単語ペアを含む、前記文書データ内の複数の同義単語ペアとのそれぞれに対し、単語間の意味類似度を学習する処理を複数回行う意味類似度学習部と、
前記未確定単語ペアの意味類似度の学習結果と、前記複数の同義単語ペアのそれぞれにおける意味類似度の学習結果とに基づいて、前記未確定単語ペアの単語同士が同義であるか否かを判定する単語同義判定部と、
前記複合名詞のペアにおける前記複数の単語ペアが全て前記同義単語ペアである場合に、該複合名詞のペアを同義表現であると判定する複合名詞同義判定部と、を備え、
前記意味類似度学習部は、処理対象である複数の単語ペアのそれぞれに対する意味類似度の学習処理を行う毎に、当該学習処理に用いる事例を追加し、
前記単語同義判定部は、前記未確定単語ペアの意味類似度の学習結果における学習処理の回数と意味類似度との関係と、前記同義単語ペアの意味類似度の学習結果における学習処理の回数と意味類似度との関係とについての相関係数を算出し、当該相関係数に基づいて、前記未確定単語ペアの単語同士が同義であるか否かを判定する、
ことを特徴とする同義表現抽出装置。
（付記２）
前記単語同義判定部は、
前記複数の同義単語ペアのそれぞれで、前記未確定単語ペアについての前記学習処理の回数と前記意味類似度との関係と、前記同義単語ペアについての前記学習処理の回数と前記意味類似度との関係とについての相関係数を算出し、
算出した複数の前記相関係数の平均値が閾値以上である場合に、前記未確定単語ペアの単語同士が同義であると判定する、
ことを特徴とする付記１に記載の同義表現抽出装置。
（付記３）
前記同義表現抽出装置は、前記複数の同義単語ペアのそれぞれにおける意味類似度の学習結果における学習処理の回数と意味類似度との関係に基づいて、前記未確定単語ペアの単語同士が同義であるか否かの判定閾値を設定する判定閾値設定部、を更に備え、
前記単語同義判定部は、算出した前記相関係数の平均値が、前記判定閾値設定部で設定した前記判定閾値以上である場合に、前記未確定単語ペアの単語同士が同義であると判定する、
ことを特徴とする付記２に記載の同義表現抽出装置。
（付記４）
前記同義表現抽出装置は、同義表現である前記複合名詞のペアを登録した同義表現リストを記憶する記憶部、を更に備える、
ことを特徴とする付記１に記載の同義表現抽出装置。
（付記５）
前記同義表現抽出装置は、前記同義表現リストに登録された前記複合名詞の同義表現に基づいて、文書データから抽出した複合名詞を同義表現に書き換える同義表現書換部、を更に備える、
ことを特徴とする付記１に記載の同義表現抽出装置。
（付記６）
コンピュータが、
文書データから抽出した複合名詞のペアを複数の単語ペアに分割し、同義である単語ペアが登録された同義単語辞書を参照して、前記複数の単語ペアを同義単語ペアと、同義であるか否かが確定していない未確定単語ペアと同定し、
前記文書データから、前記同義単語辞書に登録された前記同義単語ペアを収集し、
前記未確定単語ペアと、前記複合名詞のペアにおける前記同義単語ペアを含む、前記文書データ内の複数の同義単語ペアとのそれぞれに対し、単語間の意味類似度を学習する処理を複数回実行し、
前記未確定単語ペアの意味類似度の学習結果と、前記複数の同義単語ペアのそれぞれにおける意味類似度の学習結果とに基づいて、前記未確定単語ペアの単語同士が同義であるか否かを判定し、
前記複合名詞のペアにおける前記複数の単語ペアが全て前記同義単語ペアである場合に、該複合名詞のペアを同義表現であると判定する、処理を実行し、
前記意味類似度を学習する処理において、前記コンピュータは、処理対象である複数の単語ペアのそれぞれに対する意味類似度の学習処理を行う毎に、当該学習処理に用いる事例を追加し、
前記未確定単語ペアの単語同士が同義であるか否かを判定する処理において、前記コンピュータは、前記未確定単語ペアの意味類似度の学習結果における学習処理の回数と意味類似度との関係と、前記同義単語ペアの意味類似度の学習結果における学習処理の回数と意味類似度との関係とについての相関係数を算出し、当該相関係数に基づいて、前記未確定単語ペアの単語同士が同義であるか否かを判定する、
ことを特徴とする同義表現抽出方法。
（付記７）
前記未確定単語ペアの単語同士が同義であるか否かを判定する処理において、前記コンピュータは、
前記複数の同義単語ペアのそれぞれで、前記未確定単語ペアについての前記学習処理の回数と前記意味類似度との関係と、前記同義単語ペアについての前記学習処理の回数と前記意味類似度との関係とについての相関係数を算出し、
算出した複数の前記相関係数の平均値が閾値以上である場合に、前記未確定単語ペアの単語同士が同義であると判定する、
ことを特徴とする付記６に記載の同義表現抽出方法。
（付記８）
前記未確定単語ペアの単語同士が同義であるか否かを判定する処理を行う前に、前記コンピュータが、前記複数の同義単語ペアのそれぞれにおける意味類似度の学習結果における学習処理の回数と意味類似度との関係に基づいて、前記未確定単語ペアの単語同士が同義であるか否かの判定閾値を設定する処理を、更に含み、
前記未確定単語ペアの単語同士が同義であるか否かを判定する処理において、前記コンピュータは、前記相関係数の平均値が前記判定閾値以上である場合に、前記未確定単語ペアの単語同士が同義であると判定する、
ことを特徴とする付記７に記載の同義表現抽出方法。
（付記９）
文書データから抽出した複合名詞のペアを複数の単語ペアに分割し、同義である単語ペアが登録された同義単語辞書を参照して、前記複数の単語ペアを同義単語ペアと、同義であるか否かが確定していない未確定単語ペアと同定し、
前記文書データから、前記同義単語辞書に登録された前記同義単語ペアを収集し、
前記未確定単語ペアと、前記複合名詞のペアにおける前記同義単語ペアを含む、前記文書データ内の複数の同義単語ペアとのそれぞれに対し、単語間の意味類似度を学習する処理を複数回実行し、
前記未確定単語ペアの意味類似度の学習結果と、前記複数の同義単語ペアのそれぞれにおける意味類似度の学習結果とに基づいて、前記未確定単語ペアの単語同士が同義であるか否かを判定し、
前記複合名詞のペアにおける前記複数の単語ペアが全て前記同義単語ペアである場合に、該複合名詞のペアを同義表現であると判定する、処理をコンピュータに実行させる同義表現抽出プログラムであって、
前記意味類似度を学習する処理は、処理対象である複数の単語ペアのそれぞれに対する意味類似度の学習処理を行う毎に、当該学習処理に用いる事例を追加する処理を含み、
前記未確定単語ペアの単語同士が同義であるか否かを判定する処理は、前記未確定単語ペアの意味類似度の学習結果における学習処理の回数と意味類似度との関係と、前記同義単語ペアの意味類似度の学習結果における学習処理の回数と意味類似度との関係とについての相関係数を算出し、当該相関係数に基づいて、前記未確定単語ペアの単語同士が同義であるか否かを判定する処理を含む、
ことを特徴とする同義表現抽出プログラム。
（付記１０）
前記未確定単語ペアの単語同士が同義であるか否かを判定する処理は、
前記複数の同義単語ペアのそれぞれで、前記未確定単語ペアについての前記学習処理の回数と前記意味類似度との関係と、前記同義単語ペアについての前記学習処理の回数と前記意味類似度との関係とについての相関係数を算出し、
算出した複数の前記相関係数の平均値が閾値以上である場合に、前記未確定単語ペアの単語同士が同義であると判定する、処理を含む、
ことを特徴とする付記９に記載の同義表現抽出プログラム。
（付記１１）
前記未確定単語ペアの単語同士が同義であるか否かを判定する処理の前に実行する、前記複数の同義単語ペアのそれぞれにおける意味類似度の学習結果における学習処理の回数と意味類似度との関係に基づいて、前記未確定単語ペアの単語同士が同義であるか否かの判定閾値を設定する処理を、更に含み、
前記未確定単語ペアの単語同士が同義であるか否かを判定する処理は、前記相関係数の平均値が前記判定閾値以上である場合に、前記未確定単語ペアの単語同士が同義であると判定する、
ことを特徴とする付記１０に記載の同義表現抽出プログラム。 The following supplementary notes will be further disclosed regarding each of the embodiments described above.
(Appendix 1)
Divide the compound noun pair extracted from the document data into a plurality of word pairs, and refer to the synonymous word dictionary in which synonymous word pairs are registered, and refer to the synonymous word pairs as synonymous word pairs. A word pair setting unit for identifying an undetermined word pair whose presence or absence has not been confirmed,
For each of the plurality of synonymous word pairs in the document data, including the synonymous word pair in the compound noun pair and the undetermined word pair, a process of learning the semantic similarity between words is performed a plurality of times. A semantic similarity learning unit,
Based on the learning result of the semantic similarity of the undetermined word pair and the learning result of the semantic similarity in each of the plurality of synonymous word pairs, whether the words of the undetermined word pair are synonymous or not. A word synonym determination unit for determination,
When the plurality of word pairs in the pair of compound nouns are all synonymous word pairs, a compound noun synonym determination unit that determines that the pair of compound nouns are synonymous expressions, and
The semantic similarity learning unit adds a case to be used for the learning process every time the semantic similarity learning process is performed on each of a plurality of word pairs to be processed,
The word synonym determination unit, the relationship between the number of learning processing and the semantic similarity in the learning result of the semantic similarity of the undetermined word pair, and the number of learning processing in the learning result of the semantic similarity of the synonymous word pair and Calculate a correlation coefficient for the relationship with the semantic similarity, based on the correlation coefficient, to determine whether the words of the undetermined word pair are synonymous,
A synonymous expression extraction device characterized by the above.
(Appendix 2)
The word synonym determination unit,
In each of the plurality of synonymous word pairs, the relationship between the number of learning processes for the undetermined word pair and the semantic similarity, the number of learning processes for the synonymous word pair and the semantic similarity Calculate the correlation coefficient for and
When the average value of the plurality of calculated correlation coefficients is equal to or greater than a threshold value, it is determined that the words of the undetermined word pair are synonymous with each other,
The synonymous expression extracting device according to appendix 1, characterized in that.
(Appendix 3)
The synonymous expression extraction device, based on the relationship between the number of learning processing and the semantic similarity in the learning result of the semantic similarity in each of the plurality of synonymous word pairs, the words of the undetermined word pair are synonymous Further comprising a determination threshold setting unit for setting a determination threshold of whether or not,
The word synonym determination unit determines that the words of the undetermined word pair are synonymous when the average value of the calculated correlation coefficient is equal to or more than the determination threshold set by the determination threshold setting unit. ,
The synonymous expression extraction device as described in appendix 2, wherein.
(Appendix 4)
The synonym expression extraction device further includes a storage unit that stores a synonym expression list in which the pair of compound nouns that are synonymous expressions are registered.
The synonymous expression extracting device according to appendix 1, characterized in that.
(Appendix 5)
The synonym expression extraction device further includes a synonym expression rewriting unit that rewrites a compound noun extracted from document data into a synonym expression, based on the synonym expression of the compound noun registered in the synonym expression list,
The synonymous expression extracting device according to appendix 1, characterized in that.
(Appendix 6)
Computer
Divide the compound noun pair extracted from the document data into a plurality of word pairs, and refer to the synonymous word dictionary in which synonymous word pairs are registered, and refer to the synonymous word pairs as synonymous word pairs. Identified as an undetermined word pair with undetermined whether
From the document data, collect the synonym word pairs registered in the synonym word dictionary,
The process of learning the semantic similarity between words is executed a plurality of times for each of the plurality of synonymous word pairs in the document data including the synonymous word pair in the compound noun pair and the undetermined word pair. Then
Based on the learning result of the semantic similarity of the undetermined word pair and the learning result of the semantic similarity in each of the plurality of synonymous word pairs, whether the words of the undetermined word pair are synonymous or not. Judge,
When the plurality of word pairs in the pair of compound nouns are all synonymous word pairs, it is determined that the pair of compound nouns are synonymous expressions, a process is executed,
In the process of learning the semantic similarity, the computer adds a case to be used for the learning process every time the learning process of the semantic similarity is performed for each of a plurality of word pairs to be processed,
In the process of determining whether the words of the undetermined word pair are synonymous with each other, the computer has a relationship between the number of learning processes in the learning result of the semantic similarity of the undetermined word pair and the semantic similarity. , Calculating a correlation coefficient for the relationship between the number of learning processes and the semantic similarity in the learning result of the semantic similarity of the synonymous word pair, based on the correlation coefficient, the words of the undetermined word pair Determines whether is synonymous,
A synonym expression extraction method characterized by the above.
(Appendix 7)
In the process of determining whether the words of the undetermined word pair are synonymous, the computer,
In each of the plurality of synonymous word pairs, the relationship between the number of learning processes for the undetermined word pair and the semantic similarity, the number of learning processes for the synonymous word pair and the semantic similarity Calculate the correlation coefficient for and
When the average value of the plurality of calculated correlation coefficients is equal to or greater than a threshold value, it is determined that the words of the undetermined word pair are synonymous with each other,
The synonymous expression extraction method according to appendix 6, characterized in that
(Appendix 8)
Before performing the process of determining whether the words of the undetermined word pair are synonymous with each other, the computer, the number and the meaning of the learning process in the learning result of the semantic similarity in each of the plurality of synonymous word pairs Based on the relationship with the degree of similarity, further comprising a process of setting a determination threshold whether or not the words of the undetermined word pair are synonymous,
In the process of determining whether the words of the undetermined word pair are synonymous with each other, the computer, when the average value of the correlation coefficient is greater than or equal to the determination threshold, the words of the undetermined word pair Is determined to be synonymous,
The synonymous expression extraction method according to appendix 7, characterized in that
(Appendix 9)
Divide the compound noun pair extracted from the document data into a plurality of word pairs, and refer to the synonymous word dictionary in which synonymous word pairs are registered, and refer to the synonymous word pairs as synonymous word pairs. Identified as an undetermined word pair with undetermined whether
From the document data, collect the synonym word pairs registered in the synonym word dictionary,
The process of learning the semantic similarity between words is executed a plurality of times for each of the plurality of synonymous word pairs in the document data including the synonymous word pair in the compound noun pair and the undetermined word pair. Then
Based on the learning result of the semantic similarity of the undetermined word pair and the learning result of the semantic similarity in each of the plurality of synonymous word pairs, whether the words of the undetermined word pair are synonymous or not. Judge,
A synonym expression extraction program that causes a computer to execute a process of determining that the pair of compound nouns are synonymous expressions when all of the plurality of word pairs in the pair of compound nouns are the synonymous word pairs,
The process of learning the semantic similarity includes a process of adding a case used for the learning process every time the learning process of the semantic similarity is performed on each of a plurality of word pairs to be processed,
The process of determining whether or not the words of the undetermined word pair are synonymous with each other, the relationship between the number of learning processes and the semantic similarity in the learning result of the semantic similarity of the undetermined word pair, and the synonymous word The correlation coefficient is calculated for the relationship between the number of learning processes and the semantic similarity in the learning result of the semantic similarity of the pair, and the words of the undetermined word pair are synonymous based on the correlation coefficient. Including the process of determining whether or not
A synonym expression extraction program characterized by the following.
(Appendix 10)
The process of determining whether the words of the undetermined word pair are synonymous,
In each of the plurality of synonymous word pairs, the relationship between the number of learning processes for the undetermined word pair and the semantic similarity, the number of learning processes for the synonymous word pair and the semantic similarity Calculate the correlation coefficient for and
When the average value of the calculated plurality of correlation coefficients is equal to or more than a threshold value, it is determined that the words of the undetermined word pair are synonymous with each other, including processing.
The synonymous expression extraction program described in appendix 9.
(Appendix 11)
The number of learning processes and the semantic similarity in the learning result of the semantic similarity in each of the plurality of synonymous word pairs, which is executed before the process of determining whether or not the words of the undetermined word pair are synonymous, Based on the relationship of, further includes a process of setting a determination threshold of whether the words of the undetermined word pair are synonymous,
The process of determining whether or not the words of the undetermined word pair are synonymous with each other, when the average value of the correlation coefficient is greater than or equal to the determination threshold, the words of the undetermined word pair are synonymous. Judge,
The synonymous expression extraction program according to appendix 10, characterized in that.

１同義表現抽出装置
２，３複合名詞
４０１〜４０３，４０５，４０６テーブル
４０４単語ペアリスト
５同義語辞書作成システム
６，７ストレージ装置
８ネットワーク
９（９Ａ〜９Ｃ）端末装置
１０文書書換システム
１１文書データ書換装置
１５コンピュータ
１６可搬型記録媒体
１１０，１１２０文字列抽出部
１２０，１１３０形態素解析部
１３０，１１４０複合名詞抽出部
１４０同義複合名詞特定部
１４１単語ペア設定部
１４２類似度推移テーブル作成部
１４３意味類似度学習部
１４４単語同義判定部
１４５複合名詞同義判定部
１４６判定閾値設定部
１９１，６０１，６０２文書集合
１９２，１１９０解析用辞書
１０３解析結果コーパス
１９４同義単語辞書
１９５類似度推移テーブル
１９６同義表現リスト
７０１，７０２同義語辞書
１１１０文書データ取得部
１１５０同義表現検索部
１１６０同義表現書換部
１１７０文書データ返信部
１５０１プロセッサ
１５０２主記憶装置
１５０３補助記憶装置
１５０４入力装置
１５０５出力装置
１５０６入出力インタフェース
１５０７通信制御装置
１５０８補助記憶装置 1 Synonym expression extraction device 2, 3 Compound nouns 401-403, 405, 406 Table 404 Word pair list 5 Synonym dictionary creation system 6, 7 Storage device 8 Network 9 (9A-9C) Terminal device 10 Document rewriting system 11 Document data Rewriting device 15 Computer 16 Portable recording medium 110, 1120 Character string extraction unit 120, 1130 Morphological analysis unit 130, 1140 Compound noun extraction unit 140 Synonymous compound noun identification unit 141 Word pair setting unit 142 Similarity transition table creation unit 143 Semantic similarity Degree learning unit 144 Word synonym determination unit 145 Compound noun synonym determination unit 146 Determination threshold setting unit 191, 601, 602 Document set 192, 1190 Analysis dictionary 103 Analysis result corpus 194 Synonym word dictionary 195 Similarity transition table 196 Synonym expression list 701 , 702 Synonym dictionary 1110 Document data acquisition unit 1150 Synonym expression search unit 1160 Synonym expression rewriting unit 1170 Document data replying unit 1501 Processor 1502 Main storage device 1503 Auxiliary storage device 1504 Input device 1505 Output device 1506 Input/output interface 1507 Communication control device 1508 Auxiliary storage

Claims

Divide the compound noun pair extracted from the document data into a plurality of word pairs, and refer to the synonymous word dictionary in which synonymous word pairs are registered, and refer to the synonymous word pairs as synonymous word pairs. A word pair setting unit for identifying an undetermined word pair whose presence or absence has not been confirmed,
For each of the plurality of synonymous word pairs in the document data, including the synonymous word pair in the compound noun pair and the undetermined word pair, a process of learning the semantic similarity between words is performed a plurality of times. A semantic similarity learning unit,
Based on the learning result of the semantic similarity of the undetermined word pair and the learning result of the semantic similarity in each of the plurality of synonymous word pairs, whether the words of the undetermined word pair are synonymous or not. A word synonym determination unit for determination,
When the plurality of word pairs in the pair of compound nouns are all synonymous word pairs, a compound noun synonym determination unit that determines that the pair of compound nouns are synonymous expressions, and
The semantic similarity learning unit adds a case to be used for the learning process every time the semantic similarity learning process is performed on each of a plurality of word pairs to be processed,
The word synonym determination unit, the relationship between the number of learning processing and the semantic similarity in the learning result of the semantic similarity of the undetermined word pair, and the number of learning processing in the learning result of the semantic similarity of the synonymous word pair and Calculate a correlation coefficient for the relationship with the semantic similarity, based on the correlation coefficient, to determine whether the words of the undetermined word pair are synonymous,
A synonymous expression extraction device characterized by the above.

The word synonym determination unit,
In each of the plurality of synonymous word pairs, the relationship between the number of learning processes for the undetermined word pair and the semantic similarity, the number of learning processes for the synonymous word pair and the semantic similarity Calculate the correlation coefficient for and
When the average value of the plurality of calculated correlation coefficients is equal to or more than a threshold value, it is determined that the words of the undetermined word pair are synonymous with each other,
The synonymous expression extracting device according to claim 1, wherein

The synonymous expression extraction device, based on the relationship between the number of learning process and the semantic similarity in the learning result of the semantic similarity in each of the plurality of synonymous word pairs, the words of the undetermined word pair are synonymous Further comprising a determination threshold setting unit for setting a determination threshold of whether or not,
The word synonym determination unit determines that the words of the undetermined word pair are synonymous when the average value of the calculated correlation coefficient is equal to or more than the determination threshold set by the determination threshold setting unit. ,
The synonymous expression extracting device according to claim 2, wherein.

The synonym expression extraction device further includes a storage unit that stores a synonym expression list in which the pair of compound nouns that are synonymous expressions are registered.
The synonymous expression extracting device according to claim 1, wherein

Computer
Divide the compound noun pair extracted from the document data into a plurality of word pairs, and refer to the synonymous word dictionary in which synonymous word pairs are registered, and refer to the synonymous word pairs as synonymous word pairs. Identified as an undetermined word pair with undetermined whether
From the document data, collect the synonym word pairs registered in the synonym word dictionary,
The process of learning the semantic similarity between words is executed a plurality of times for each of the plurality of synonymous word pairs in the document data including the synonymous word pair in the compound noun pair and the undetermined word pair. Then
Based on the learning result of the semantic similarity of the undetermined word pair and the learning result of the semantic similarity in each of the plurality of synonymous word pairs, whether the words of the undetermined word pair are synonymous or not. Judge,
When the plurality of word pairs in the pair of compound nouns are all synonymous word pairs, it is determined that the pair of compound nouns are synonymous expressions, a process is executed,
In the process of learning the semantic similarity, the computer adds a case to be used for the learning process every time the learning process of the semantic similarity is performed for each of a plurality of word pairs to be processed,
In the process of determining whether the words of the undetermined word pair are synonymous with each other, the computer has a relationship between the number of learning processes in the learning result of the semantic similarity of the undetermined word pair and the semantic similarity. , Calculating a correlation coefficient for the relationship between the number of learning processes and the semantic similarity in the learning result of the semantic similarity of the synonymous word pair, based on the correlation coefficient, the words of the undetermined word pair Determines whether is synonymous,
A synonym expression extraction method characterized by the above.

Divide the compound noun pairs extracted from the document data into a plurality of word pairs, and refer to the synonymous word dictionary in which synonymous word pairs are registered, and refer to the synonymous word pairs as synonymous word pairs. Identified as an undetermined word pair with undetermined whether
From the document data, collect the synonym word pairs registered in the synonym word dictionary,
The process of learning the semantic similarity between words is executed a plurality of times for each of the plurality of synonymous word pairs in the document data including the synonymous word pair in the compound noun pair and the undetermined word pair. Then
Based on the learning result of the semantic similarity of the undetermined word pair and the learning result of the semantic similarity in each of the plurality of synonymous word pairs, whether the words of the undetermined word pair are synonymous or not. Judge,
A synonym expression extraction program that causes a computer to execute a process of determining that the pair of compound nouns are synonymous expressions when all of the plurality of word pairs in the pair of compound nouns are the synonymous word pairs,
The process of learning the semantic similarity includes a process of adding a case used for the learning process every time the learning process of the semantic similarity is performed on each of a plurality of word pairs to be processed,
The process of determining whether or not the words of the undetermined word pair are synonymous with each other, the relationship between the number of learning processes and the semantic similarity in the learning result of the semantic similarity of the undetermined word pair, and the synonymous word The correlation coefficient is calculated for the relationship between the number of learning processes and the semantic similarity in the learning result of the semantic similarity of the pair, and the words of the undetermined word pair are synonymous based on the correlation coefficient. Including the process of determining whether or not
A synonym expression extraction program characterized by the following.