JP5117590B2

JP5117590B2 - Document processing apparatus and program

Info

Publication number: JP5117590B2
Application number: JP2011065006A
Authority: JP
Inventors: 佳美齋藤; 敏行加納; 早織新田
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2011-03-23
Filing date: 2011-03-23
Publication date: 2013-01-16
Anticipated expiration: 2031-03-23
Also published as: JP2012203472A

Description

本発明の実施形態は、文書の中から類義語を抽出する文書処理装置およびプログラムに関する。 Embodiments described herein relate generally to a document processing apparatus and program for extracting synonyms from a document.

一般的に、文書処理においては、ある用語（単語）と同様の意味を表す別の用語、すなわち類義語の取り扱いが重要な課題である。 In general, in document processing, the handling of another term representing the same meaning as a certain term (word), that is, a synonym is an important issue.

例えば「消しゴム」および「字消し」は、互いに同様の意味を表す類義語であると考えられるが、当該「消しゴム」および「字消し」が類義語であることを示す情報（以下、類義語情報と表記）が予め用意されていなければ、一方から他方（例えば「消しゴム」から「字消し」）を検索または生成することは容易ではない。 For example, “eraser” and “eraser” are considered to be synonyms representing the same meaning, but information indicating that “eraser” and “eraser” are synonyms (hereinafter referred to as synonym information). Is not prepared in advance, it is not easy to search or generate one from the other (for example, “eraser” to “eraser”).

具体的には、文書検索において、「消しゴム」および「字消し」が類義語であることを示す類義語情報を用いることなく検索キーワード「消しゴム」から「字消し」を含む文書を検索することは困難である。また、文書分類において、「消しゴム」および「字消し」が類義語であることを示す類義語情報が用意されていなければ、「消しゴム」を含む文書と「字消し」を含む文書とが本来は同じ分類に属すべきであるにもかかわらず、別の分類とされてしまう場合がある。 Specifically, in document search, it is difficult to search for a document including “eraser” from the search keyword “eraser” without using synonym information indicating that “eraser” and “eraser” are synonyms. is there. In addition, in the document classification, if synonym information indicating that "eraser" and "eraser" are synonyms is not prepared, the document including "eraser" and the document including "eraser" are essentially the same classification. Although it should belong to, it may be classified as a different category.

そこで、このような場合には、上記した類義語の関係にある用語（例えば、「消しゴム」および「字消し」）が予め登録されている類義語辞書を参照して、文書検索または文書分類等を実行することが知られている。 Therefore, in such a case, a document search or document classification is executed by referring to a synonym dictionary in which terms (for example, “eraser” and “eraser”) related to the above-described synonyms are registered in advance. It is known to do.

しかしながら、類義語の関係にある用語の数は膨大であることから、当該類義語の関係にある全ての用語を人手で予め準備しておくことは、膨大な作業を伴い、困難である。 However, since the number of terms having a synonym relationship is enormous, it is difficult to prepare all the terms having the synonym relationship in advance manually and enormously.

そこで、例えば予め与えられた文書（集合）中に出現する用語の文脈類似度または文字列類似度を用いて、当該文書から類義語を自動的に抽出することが考えられている。これによれば、類義語の関係にある全ての用語を人手で予め準備をすることなく、類義語辞書に登録することが可能となる。 Thus, for example, it is considered that synonyms are automatically extracted from a document using the context similarity or the character string similarity of terms appearing in a given document (set). According to this, it is possible to register all terms having a synonym relationship in the synonym dictionary without manually preparing in advance.

なお、文脈類似度は、「意味的に似た語は似た文脈で出現する」との分布仮説に基づいて算出される類似度であり、例えば係り受け関係にある用語または共起する用語の類似度として算出される。また、文字列類似度は、用語を構成する文字列自体の類似度であって、例えば２つの用語において共通する文字数に応じて算出される。 Note that the context similarity is a similarity calculated based on the distribution hypothesis that “semanically similar words appear in similar contexts”. For example, the terms of dependency terms or co-occurring terms Calculated as similarity. Further, the character string similarity is the similarity of the character strings themselves constituting the terms, and is calculated according to, for example, the number of characters common to the two terms.

特開平５−３４６９３８号公報JP-A-5-346938 特開２０００−２２２４２７号公報JP 2000-222427 A

しかしながら、上記した文脈類似度および文字列類似度は、文書内の局所的な類似性に基づくものである。 However, the above-described context similarity and character string similarity are based on local similarity in the document.

このため、単に文脈類似度または文字列類似度を用いて文書から類義語の関係にある用語の集合（類義語集合）が抽出された場合には、局所的な情報の影響が強く、類義語として相応しくない用語（ノイズ用語）が当該類義語集合に含まれてしまう場合がある。具体的には、単に文脈類似度または文字列類似度を用いて類義語集合が抽出された場合、当該類義語集合には、適切な類義語である「消しゴム」および「字消し」に加えて、当該類義語として相応しくない例えば「取り消し」等が含まれる場合がある。 For this reason, when a set of terms (synonym sets) having a synonym relation is extracted from a document using only the context similarity or the string similarity, the influence of local information is strong and is not suitable as a synonym. A term (noise term) may be included in the synonym set. Specifically, when a synonym set is extracted simply using context similarity or character string similarity, in addition to the appropriate synonyms “eraser” and “eraser”, the synonym is included in the synonym set. For example, “cancellation” may be included.

このように類義語として相応しくない用語が含まれた類義語集合（類義語の関係にある用語）が登録された類義語辞書を参照して文書検索または文書分類等が行われた場合には、適切な結果を得ることができない場合がある。 When a document search or document classification is performed with reference to a synonym dictionary that contains a set of synonyms (terms related to synonyms) that contain terms that are not suitable as synonyms, an appropriate result is obtained. You may not get it.

そこで、本発明が解決しようとする課題は、類義語として適切な用語を文書から抽出することが可能な文書処理装置およびプログラムを提供することにある。 Therefore, a problem to be solved by the present invention is to provide a document processing apparatus and program capable of extracting an appropriate term as a synonym from a document.

本実施形態に係る文書処理装置は、文書格納手段と、用語抽出手段と、類似度算出手段と、クラスタ生成手段と、特徴度算出手段と、類義語抽出手段とを具備する。 The document processing apparatus according to the present embodiment includes a document storage unit, a term extraction unit, a similarity calculation unit, a cluster generation unit, a feature calculation unit, and a synonym extraction unit.

文書格納手段は、１つまたは複数の単語からなる用語を含む複数の文書を格納する。 The document storage means stores a plurality of documents including terms consisting of one or a plurality of words.

用語抽出手段は、前記文書格納手段に格納されている複数の文書に含まれる用語の中から第１および第２の用語を抽出する。 The term extracting unit extracts the first and second terms from the terms included in the plurality of documents stored in the document storing unit.

クラスタ生成手段は、前記文書格納手段に格納されている複数の文書の各々が属するクラスタを生成する。 The cluster generation unit generates a cluster to which each of the plurality of documents stored in the document storage unit belongs.

特徴度算出手段は、前記文書格納手段に格納されている複数の文書および前記生成されたクラスタに属する文書における前記抽出された第１の用語の出現頻度に基づいて当該クラスタに対する第１の用語の特徴度を算出し、前記文書格納手段に格納されている複数の文書および前記生成されたクラスタに属する文書における前記抽出された第２の用語の出現頻度に基づいて当該クラスタに対する第２の用語の特徴度を算出する。 The feature degree calculating unit is configured to determine the first term for the cluster based on the appearance frequency of the extracted first term in the plurality of documents stored in the document storing unit and the document belonging to the generated cluster. A feature degree is calculated, and based on the appearance frequency of the extracted second term in the plurality of documents stored in the document storage unit and the document belonging to the generated cluster, the second term for the cluster is calculated. The feature degree is calculated.

前記類似度算出手段によって算出された類似度、前記特徴度算出手段によって算出された第１の用語の特徴度および第２の用語の特徴度に基づいて、当該第１および第２の用語を類義語として抽出する。 Based on the similarity calculated by the similarity calculation unit, the feature level of the first term and the feature level of the second term calculated by the feature level calculation unit, the first and second terms are synonymous. Extract as

第１の実施形態に係る文書処理装置のハードウェア構成を示すブロック図。1 is a block diagram showing a hardware configuration of a document processing apparatus according to a first embodiment. 図１に示す文書処理装置３０の主として機能構成を示すブロック図。FIG. 2 is a block diagram mainly showing a functional configuration of the document processing apparatus 30 shown in FIG. 1. 図２に示す文書データベース２２のデータ構造の一例を示す図。The figure which shows an example of the data structure of the document database 22 shown in FIG. 本実施形態に係る文書処理装置３０の処理手順を示すフローチャート。6 is a flowchart showing a processing procedure of the document processing apparatus 30 according to the present embodiment. 解析結果格納部２３のデータ構造の一例を示す図。The figure which shows an example of the data structure of the analysis result storage part. 解析結果格納部２３のデータ構造の一例を示す図。The figure which shows an example of the data structure of the analysis result storage part. 用語集計結果格納部２４のデータ構造の一例を示す図。The figure which shows an example of the data structure of the term totalization result storage part 24. FIG. クラスタ生成部３５によって生成されたクラスタについて説明するための図。The figure for demonstrating the cluster produced | generated by the cluster production | generation part 35. FIG. 類似度算出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a similarity calculation process. 類似度算出部３４によって生成された中間処理結果情報のデータ構造の一例を示す図。The figure which shows an example of the data structure of the intermediate process result information produced | generated by the similarity calculation part. 用語Ａおよび用語Ｂの組が格納された後の類似度算出結果格納部２５のデータ構造の一例を示す図。The figure which shows an example of the data structure of the similarity calculation result storage part 25 after the group of the term A and the term B is stored. 用語Ａの出現頻度が格納された後の類似度算出結果格納部２５のデータ構造の一例を示す図。The figure which shows an example of the data structure of the similarity calculation result storage part 25 after the appearance frequency of the term A is stored. 用語Ａおよび用語Ｂの文脈類似度が格納された後の類似度算出結果格納部２５のデータ構造の一例を示す図。The figure which shows an example of the data structure of the similarity calculation result storage part 25 after the context similarity of the term A and the term B is stored. 用語Ａおよび用語Ｂの文字列類似度が格納された後の類似度算出結果格納部２５のデータ構造の一例を示す図。The figure which shows an example of the data structure of the similarity calculation result storage part 25 after the character string similarity of the term A and the term B is stored. 特徴度算出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a feature degree calculation process. 用語毎にクラスタ１〜３におけるクラスタ出現頻度が格納された後の特徴度算出結果格納部２６のデータ構造の一例を示す図。The figure which shows an example of the data structure of the characteristic calculation result storage part 26 after the cluster appearance frequency in the clusters 1-3 is stored for every term. 各用語のクラスタ１〜３に対する特徴度が格納された後の特徴度算出結果格納部２６のデータ構造の一例を示す図。The figure which shows an example of the data structure of the characteristic calculation result storage part 26 after the characteristic with respect to the clusters 1-3 of each term is stored. クラスタ１〜３に対して特徴的な用語について説明するための図。The figure for demonstrating the terminology with respect to the clusters 1-3. 類義語集合抽出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a synonym set extraction process. 類似度算出結果が格納された後の類義語集合格納部２７のデータ構造の一例を示す図。The figure which shows an example of the data structure of the synonym set storage part 27 after the similarity calculation result is stored. 特徴度算出結果が格納された後の類義語集合格納部２７のデータ構造の一例を示す図。The figure which shows an example of the data structure of the synonym set storage part 27 after the characteristic degree calculation result is stored. 類義語集合抽出部３７による判定結果を示す図。The figure which shows the determination result by the synonym set extraction part 37. FIG. クラスタ１の場合における類義語集合抽出処理について説明するための図。The figure for demonstrating the synonym set extraction process in the case of the cluster 1. FIG. クラスタ３の場合における類義語集合抽出処理について説明するための図。The figure for demonstrating the synonym set extraction process in the case of the cluster 3. FIG. 類義語集合抽出部３７によって類義語として抽出された２つの用語が表示された場合における表示画面の一例を示す図。The figure which shows an example of the display screen in case the two terms extracted as a synonym by the synonym set extraction part 37 are displayed. 第２の実施形態に係る文書処理装置３０の処理手順を示すフローチャート。10 is a flowchart showing a processing procedure of the document processing apparatus 30 according to the second embodiment. 類似度算出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a similarity calculation process. 類似度算出部３４によって生成された中間処理結果情報のデータ構造の一例を示す図。The figure which shows an example of the data structure of the intermediate process result information produced | generated by the similarity calculation part. クラスタ１〜３の各々について処理が実行された後の類似度算出結果格納部２５のデータ構造の一例を示す図。The figure which shows an example of the data structure of the similarity calculation result storage part 25 after a process is performed about each of the clusters 1-3. 類義語集合抽出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a synonym set extraction process.

以下、図面を参照して、各実施形態について説明する。 Hereinafter, each embodiment will be described with reference to the drawings.

（第１の実施形態）
まず、図１および図２を参照して、第１の実施形態について説明する。図１は、本実施形態に係る文書処理装置のハードウェア構成を示すブロック図である。図１に示すように、コンピュータ１０は、例えばハードディスクドライブ（ＨＤＤ：Hard Disk Drive）のような外部記憶装置２０と接続されている。この外部記憶装置２０は、コンピュータ１０によって実行されるプログラム２１を格納する。コンピュータ１０および外部記憶装置２０は、文書処理装置３０を構成する。 (First embodiment)
First, the first embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing a hardware configuration of the document processing apparatus according to the present embodiment. As shown in FIG. 1, the computer 10 is connected to an external storage device 20 such as a hard disk drive (HDD). The external storage device 20 stores a program 21 executed by the computer 10. The computer 10 and the external storage device 20 constitute a document processing device 30.

図２は、図１に示す文書処理装置３０の主として機能構成を示すブロック図である。図２に示すように、文書処理装置３０は、入力処理部３１、解析部３２、用語集計部３３、類似度算出部、クラスタ生成部３５、特徴度算出部３６、類義語集合抽出部３７および出力処理部３８を含む。本実施形態において、これらの各部３１〜３８は、図１に示すコンピュータ１０が外部記憶装置２０に格納されているプログラム２１を実行することにより実現されるものとする。このプログラム２１は、コンピュータ読み取り可能な記憶媒体に予め格納して頒布可能である。また、このプログラム２１が、例えばネットワークを介してコンピュータ１０にダウンロードされても構わない。 FIG. 2 is a block diagram mainly showing a functional configuration of the document processing apparatus 30 shown in FIG. As shown in FIG. 2, the document processing apparatus 30 includes an input processing unit 31, an analysis unit 32, a term totaling unit 33, a similarity calculation unit, a cluster generation unit 35, a feature calculation unit 36, a synonym set extraction unit 37, and an output. A processing unit 38 is included. In the present embodiment, these units 31 to 38 are realized by the computer 10 illustrated in FIG. 1 executing the program 21 stored in the external storage device 20. This program 21 can be stored in advance in a computer-readable storage medium and distributed. Further, this program 21 may be downloaded to the computer 10 via, for example, a network.

また、文書処理装置３０は、文書データベース（ＤＢ）２２、解析結果格納部２３、用語集計結果格納部２４、類似度算出結果格納部２５、特徴度算出結果格納部２６および類義語集合格納部２７を含む。本実施形態において、文書データベース（ＤＢ）２２、解析結果格納部２３、用語集計結果格納部２４、類似度算出結果格納部２５、特徴度算出結果格納部２６および類義語集合格納部２７は、例えば外部記憶装置２０に格納される。 In addition, the document processing apparatus 30 includes a document database (DB) 22, an analysis result storage unit 23, a term aggregation result storage unit 24, a similarity calculation result storage unit 25, a feature calculation result storage unit 26, and a synonym set storage unit 27. Including. In the present embodiment, the document database (DB) 22, the analysis result storage unit 23, the term aggregation result storage unit 24, the similarity calculation result storage unit 25, the feature calculation result storage unit 26, and the synonym set storage unit 27 are, for example, external It is stored in the storage device 20.

文書データベース２２には、文書処理装置３０の処理の対象となる複数の文書が予め格納されている。文書データベース２２に格納されている文書には、１つまたは複数の単語からなる用語が含まれる。 In the document database 22, a plurality of documents to be processed by the document processing apparatus 30 are stored in advance. The document stored in the document database 22 includes a term consisting of one or more words.

入力処理部３１は、ユーザの指示入力または外部からのデータ入力等を処理する。入力処理部３１は、例えばユーザの操作に応じて文書処理装置３０の処理の実行の指示を受け付ける。 The input processing unit 31 processes a user instruction input or external data input. The input processing unit 31 receives an instruction to execute processing of the document processing device 30 in accordance with, for example, a user operation.

解析部３２は、文書データベース２２に格納されている複数の文書を解析（例えば、形態素解析および構文解析）する。これにより、解析部３２は、複数の文書の解析結果を取得する。複数の文書の解析結果には、当該複数の文書に含まれる用語および当該用語間の係り受け関係が含まれる。なお、解析部３２によって取得された解析結果は、解析結果格納部２３に格納される。 The analysis unit 32 analyzes a plurality of documents stored in the document database 22 (for example, morphological analysis and syntax analysis). Thereby, the analysis unit 32 acquires analysis results of a plurality of documents. The analysis results of a plurality of documents include terms included in the plurality of documents and dependency relationships between the terms. Note that the analysis result acquired by the analysis unit 32 is stored in the analysis result storage unit 23.

用語集計部３３は、解析結果格納部２３に格納された解析結果に基づいて、当該解析結果に含まれる用語毎に出現頻度（文書データベース２２に格納されている複数の文書における出現頻度）を集計する。これにより、用語集計部３３は、用語毎の出現頻度を含む用語集計結果を取得する。用語集計部３３によって取得された用語集計結果は、用語集計結果２４に格納される。 Based on the analysis result stored in the analysis result storage unit 23, the term aggregation unit 33 calculates the appearance frequency (appearance frequency in a plurality of documents stored in the document database 22) for each term included in the analysis result. To do. Thereby, the term totaling unit 33 acquires a term totaling result including the appearance frequency for each term. The term totaling result acquired by the term totaling unit 33 is stored in the term totaling result 24.

類似度算出部３４は、解析結果格納部２３に格納された解析結果に基づいて、当該解析結果に含まれる用語（つまり、文書データベース２２に格納されている複数の文書に含まれる用語）の中から２つの用語（第１および第２の用語）を抽出する。この場合、類似度算出部３４は、例えば同一の用語（品詞が動詞である用語）と同一の係り受け関係にある２つの用語（品詞が名詞である用語）を抽出する。類似度算出部３４は、抽出された２つの用語の類似度を算出する。 Based on the analysis result stored in the analysis result storage unit 23, the similarity calculation unit 34 includes terms included in the analysis result (that is, terms included in a plurality of documents stored in the document database 22). To extract two terms (first and second terms). In this case, the similarity calculation unit 34 extracts, for example, two terms (a term in which the part of speech is a noun) having the same dependency relationship as the same term (a term in which the part of speech is a verb). The similarity calculation unit 34 calculates the similarity between the two extracted terms.

類似度算出部３４は、文脈類似度算出部３４１および文字列類似度算出部３４２を含む。 The similarity calculation unit 34 includes a context similarity calculation unit 341 and a character string similarity calculation unit 342.

文脈類似度算出部３４１は、解析結果格納部２３に格納された解析結果および用語集計結果格納部２４に格納された用語集計結果に基づいて、類似度算出部３４によって抽出された２つの用語の類似度として文脈類似度を算出する。なお、文脈類似度は、「意味的に似た語は似た文脈で出現する」との分布仮説に基づいて算出される類似度である。 Based on the analysis result stored in the analysis result storage unit 23 and the term aggregation result stored in the term aggregation result storage unit 24, the context similarity calculation unit 341 calculates the two terms extracted by the similarity calculation unit 34. The context similarity is calculated as the similarity. Note that the context similarity is a similarity calculated based on a distribution hypothesis that “a semantically similar word appears in a similar context”.

文字列類似度算出部３４２は、類似度算出部３４によって抽出された２つの用語の類似度として文字列類似度を算出する。なお、文字列類似度は、用語を構成する文字列自体の類似度である。 The character string similarity calculation unit 342 calculates the character string similarity as the similarity between the two terms extracted by the similarity calculation unit 34. The character string similarity is the similarity of the character strings themselves that constitute the term.

類似度算出部３４によって算出された類似度（文脈類似度算出部３４１によって算出された文脈類似度および文字列類似度算出部３４２によって算出された文字列類似度）は、類似度算出結果格納部２５に格納される。 The similarity calculated by the similarity calculation unit 34 (the context similarity calculated by the context similarity calculation unit 341 and the character string similarity calculated by the character string similarity calculation unit 342) is the similarity calculation result storage unit. 25.

クラスタ生成部３５は、文書データベース２２に格納されている複数の文書が属するクラスタを生成する。 The cluster generation unit 35 generates a cluster to which a plurality of documents stored in the document database 22 belong.

特徴度算出部３６は、用語集計結果格納部２４に格納された用語集計結果に基づいて、当該用語集計結果に含まれる各用語の特徴度を算出する。このとき、特徴度算出部３６は、クラスタ生成部３５によって生成されたクラスタ毎に特徴度を算出する。特徴度算出部３６によって算出された特徴度は、特徴度算出結果格納部２６に格納される。 The feature degree calculation unit 36 calculates the feature degree of each term included in the term aggregation result based on the term aggregation result stored in the term aggregation result storage unit 24. At this time, the feature calculation unit 36 calculates the feature for each cluster generated by the cluster generation unit 35. The feature degree calculated by the feature degree calculation unit 36 is stored in the feature degree calculation result storage unit 26.

類義語集合抽出部３７は、類似度算出結果格納部２５に格納された類似度および特徴度算出結果格納部２６に格納された特徴度に基づいて、類似度算出部３４によって抽出された２つの用語を類義語として抽出する。類義語集合抽出部３７による処理結果は、類義語集合格納部２７に格納される。 The synonym set extraction unit 37 uses the two terms extracted by the similarity calculation unit 34 based on the similarity stored in the similarity calculation result storage unit 25 and the feature degrees stored in the feature calculation result storage unit 26. Are extracted as synonyms. The processing result by the synonym set extraction unit 37 is stored in the synonym set storage unit 27.

出力処理部３８は、ユーザへの表示出力または外部へのデータ出力等を処理する。出力処理部３８は、例えば類義語集合抽出部３７によって抽出された類義語集合を出力する。 The output processing unit 38 processes display output to the user or data output to the outside. The output processing unit 38 outputs the synonym set extracted by the synonym set extraction unit 37, for example.

図３は、図２に示す文書データベース２２のデータ構造の一例を示す。図３に示す文書データベース２２には、文書２２１を含む複数の文書が格納されている。 FIG. 3 shows an example of the data structure of the document database 22 shown in FIG. A plurality of documents including the document 221 are stored in the document database 22 shown in FIG.

文書データベース２２に格納されている各文書には、文書ＩＤ、テキスト、日付、作成者ＩＤおよび分類コードが対応づけて含まれる。 Each document stored in the document database 22 includes a document ID, text, date, creator ID, and classification code in association with each other.

文書ＩＤは、文書を識別するための識別子である。テキストは、対応づけられている文書ＩＤによって識別される文書の内容を示し、例えば１つまたは複数の単語からなる用語を含む。なお、複数の単語からなる用語には、例えば複合語等が含まれる。 The document ID is an identifier for identifying a document. The text indicates the content of the document identified by the associated document ID and includes, for example, a term consisting of one or more words. Note that a term composed of a plurality of words includes, for example, a compound word.

日付は、対応づけられている文書ＩＤによって識別される文書が作成された日付または更新された日付を示す。作成者ＩＤは、対応づけられている文書ＩＤによって識別される文書を作成した作成者を識別するための識別子である。 The date indicates the date when the document identified by the associated document ID was created or updated. The creator ID is an identifier for identifying the creator who created the document identified by the associated document ID.

分類コードは、対応づけられている文書ＩＤによって識別される文書が分類された場合に当該文書が属する分類を示す。この分類コードは、例えば対応づけられている文書ＩＤによって識別される文書の作成者（つまり、対応づけられている作成者ＩＤによって識別される作成者）が当該文書を文書データベース２２に登録する際に予め設定される。なお、分類コードは、例えば文書データベース２２に格納されている複数の文書に対して自動クラスタリング等を実行することにより機械的に決定されても構わない。 The classification code indicates the classification to which the document belongs when the document identified by the associated document ID is classified. This classification code is used when, for example, the creator of a document identified by the associated document ID (that is, the creator identified by the associated creator ID) registers the document in the document database 22. Is preset. The classification code may be determined mechanically by executing automatic clustering on a plurality of documents stored in the document database 22, for example.

図３に示す例では、文書２２１には、文書ＩＤ「１」、テキスト「鉛筆と消しゴムと時計を持参して下さい。」、日付「２０１０−１−１」、作成者ＩＤ「１」および分類コード「Ａ」が含まれている。これによれば、文書２２１は、文書ＩＤ「１」によって識別される文書であり、当該文書２２１の内容が「鉛筆と消しゴムと時計を持参して下さい。」であることが示されている。また、文書２２１は、作成者ＩＤ「１」によって日付「２０１０−１−１」に作成され、分類コード「Ａ」に属することが示されている。 In the example illustrated in FIG. 3, the document 221 includes a document ID “1”, a text “Please bring a pencil, an eraser, and a clock”, a date “2010-1-1”, a creator ID “1”, and a classification. The code “A” is included. According to this, the document 221 is a document identified by the document ID “1”, and it is indicated that the content of the document 221 is “Please bring a pencil, an eraser, and a clock.” Further, the document 221 is created on the date “2010-1-1” by the creator ID “1”, and indicates that it belongs to the classification code “A”.

ここでは、文書データベース２２に格納されている複数の文書のうちの文書２２１について説明したが、他の文書についても同様であるため、その詳しい説明を省略する。 Here, the document 221 of the plurality of documents stored in the document database 22 has been described, but the same applies to other documents, and thus detailed description thereof is omitted.

なお、図３に示す例では、各文書に含まれるテキストは便宜的に１つの文のみから構成されているが、当該テキストは複数の文（２文以上）から構成されていても構わない。 In the example shown in FIG. 3, the text included in each document is composed of only one sentence for convenience, but the text may be composed of a plurality of sentences (two or more sentences).

次に、図４のフローチャートを参照して、本実施形態に係る文書処理装置３０の処理手順について説明する。以下の説明においては、文書データベース２２には、図３に示す複数の文書が格納されているものとする。 Next, the processing procedure of the document processing apparatus 30 according to the present embodiment will be described with reference to the flowchart of FIG. In the following description, it is assumed that the document database 22 stores a plurality of documents shown in FIG.

まず、入力処理部３１は、ユーザの操作に応じて、当該ユーザからの文書処理装置３０の処理を実行する旨の指示（以下、実行指示と表記）を入力する（ステップＳ１）。 First, the input processing unit 31 inputs an instruction (hereinafter referred to as an execution instruction) to execute the processing of the document processing apparatus 30 from the user in accordance with a user operation (step S1).

入力処理部３１によって実行指示が入力されると、解析部３２は、文書データベース２２に格納されている複数の文書を取得する。解析部３２は、取得された複数の文書を解析する（ステップＳ２）。このとき、解析部３２は、例えば形態素解析処理および構文解析処理を実行する。これにより、解析部３２は、例えば構文解析結果を示す依存木構造のうち、名詞と動詞との係り受け関係を示す情報（以下、係り受け関係情報と表記）を取得する。 When an execution instruction is input by the input processing unit 31, the analysis unit 32 acquires a plurality of documents stored in the document database 22. The analysis unit 32 analyzes the plurality of acquired documents (step S2). At this time, the analysis unit 32 executes, for example, morpheme analysis processing and syntax analysis processing. Thereby, the analysis part 32 acquires the information (henceforth expressed as dependency relationship information) which shows the dependency relationship between a noun and a verb among the dependency tree structure which shows a syntax analysis result, for example.

なお、解析部３２によって抽出された係り受け関係情報は、解析結果として解析結果格納部２３に格納される。 The dependency relationship information extracted by the analysis unit 32 is stored in the analysis result storage unit 23 as an analysis result.

ここで、図５および図６は、解析結果格納部２３のデータ構造の一例を示す。図５および図６に示す解析結果格納部２３には、係り受け関係情報２３１〜２３３を含む複数の係り受け関係情報が格納されている。なお、係り受け関係情報は、上記したように名詞と動詞との係り受け関係を示す情報である。 Here, FIG. 5 and FIG. 6 show an example of the data structure of the analysis result storage unit 23. 5 and 6 stores a plurality of pieces of dependency relationship information including dependency relationship information 231 to 233. The dependency relationship information is information indicating a dependency relationship between a noun and a verb as described above.

解析結果格納部２３に格納されている係り受け関係情報には、当該係り受け関係情報を識別するための係り受け関係情報ＩＤ、用語１、用語２、関係および文書ＩＤが対応づけて含まれる。なお、係り受け関係情報ＩＤは、解析部２３によって取得された係り受け関係情報毎に付与される。 The dependency relationship information stored in the analysis result storage unit 23 includes dependency relationship information ID, term 1, term 2, relationship, and document ID for identifying the dependency relationship information. Note that the dependency relationship information ID is assigned to each dependency relationship information acquired by the analysis unit 23.

用語１は、対応づけられている係り受け関係情報ＩＤによって識別される係り受け関係情報によって示される名詞と動詞との係り受け関係のうちの係り元の用語（つまり、名詞）を示す。用語２は、対応づけられている係り受け関係情報ＩＤによって識別される係り受け関係情報によって示される名詞と動詞との係り受け関係のうちの係り先の用語（つまり、動詞）を示す。関係は、対応づけられている係り受け関係情報ＩＤによって識別される係り受け関係情報によって示される名詞と動詞との係り受け関係（つまり、用語１と用語２との係り受け関係）を示す。文書ＩＤは、対応づけられている係り受け関係情報ＩＤによって識別される係り受け関係情報が取得された文書（つまり、当該係り受け関係情報によって示される名詞と動詞との係り受け関係が出現する文書）を識別するための識別子である。 The term 1 indicates a dependency source term (that is, a noun) in the dependency relationship between the noun and the verb indicated by the dependency relationship information identified by the corresponding dependency relationship information ID. The term 2 indicates a destination term (that is, a verb) in a dependency relationship between a noun and a verb indicated by the dependency relationship information identified by the corresponding dependency relationship information ID. The relationship indicates the dependency relationship between the noun and the verb indicated by the dependency relationship information identified by the corresponding dependency relationship information ID (that is, the dependency relationship between the term 1 and the term 2). The document ID is a document from which the dependency relationship information identified by the corresponding dependency relationship information ID is acquired (that is, a document in which a dependency relationship between a noun and a verb indicated by the dependency relationship information appears. ) Is an identifier for identifying.

図５に示す例えば係り受け関係情報２３１には、係り受け関係情報ＩＤ「１」、用語１「鉛筆」、用語２「持参」、関係「を」および文書ＩＤ「１」が含まれる。この係り受け関係情報ＩＤ「１」によって示される係り受け関係情報２３１によれば、用語１「鉛筆（名詞）」と用語２「持参（動詞）」との係り受け関係が「を」であることが示されている。また、係り受け関係情報ＩＤ「１」によって示される係り受け関係情報２３１によれば、当該係り受け関係情報２３１が文書データベース２２に格納されている複数の文書のうちの文書ＩＤ「１」によって識別される文書から取得されたことが示されている。 For example, the dependency relationship information 231 illustrated in FIG. 5 includes dependency relationship information ID “1”, term 1 “pencil”, term 2 “bringing”, relationship “O”, and document ID “1”. According to the dependency relationship information 231 indicated by the dependency relationship information ID “1”, the dependency relationship between the term 1 “pencil (noun)” and the term 2 “bringing (verb)” is “O”. It is shown. Further, according to the dependency relationship information 231 indicated by the dependency relationship information ID “1”, the dependency relationship information 231 is identified by the document ID “1” among a plurality of documents stored in the document database 22. It is shown that it was obtained from the document

また、図５に示す例えば係り受け関係情報２３２には、係り受け関係情報ＩＤ「６」、用語１「質問」、用語２「ある」、関係「が」および文書ＩＤ「３」が含まれる。この係り受け関係情報ＩＤ「６」によって示される係り受け関係情報２３２によれば、用語１「質問（名詞）」と用語２「ある（動詞）」との係り受け関係が「が」であることが示されている。また、係り受け関係情報ＩＤ「６」によって示される係り受け関係情報２３２によれば、当該係り受け関係情報２３２が文書データベース２２に格納されている複数の文書のうちの文書ＩＤ「６」によって識別される文書から取得されたことが示されている。 Further, for example, the dependency relationship information 232 illustrated in FIG. 5 includes dependency relationship information ID “6”, term 1 “question”, term 2 “present”, relationship “ga”, and document ID “3”. According to the dependency relationship information 232 indicated by the dependency relationship information ID “6”, the dependency relationship between the term 1 “question (noun)” and the term 2 “presence (verb)” is “ga”. It is shown. Further, according to the dependency relationship information 232 indicated by the dependency relationship information ID “6”, the dependency relationship information 232 is identified by the document ID “6” of the plurality of documents stored in the document database 22. It is shown that it was obtained from the document

また、図６に示す例えば係り受け関係情報２３３には、係り受け関係情報ＩＤ「２１」、用語１「字消し」、用語２「消す」、関係「で」および文書ＩＤ「８」が含まれる。この係り受け関係情報ＩＤ「２１」によって示される係り受け関係情報２３３によれば、用語１「字消し（名詞）」と用語２「消す（動詞）」との係り受け関係が「で」であることが示されている。また、係り受け関係情報ＩＤ「２１」によって示される係り受け関係情報２３３によれば、当該係り受け関係情報２３３が文書データベース２２に格納されている複数の文書のうちの文書ＩＤ「２１」によって識別される文書から取得されたことが示されている。 6 includes, for example, dependency relationship information ID “21”, term 1 “erasure”, term 2 “erasure”, relationship “de”, and document ID “8”. . According to the dependency relationship information 233 indicated by the dependency relationship information ID “21”, the dependency relationship between the term 1 “erasing (noun)” and the term 2 “erasing (verb)” is “de”. It has been shown. Further, according to the dependency relationship information 233 indicated by the dependency relationship information ID “21”, the dependency relationship information 233 is identified by the document ID “21” of the plurality of documents stored in the document database 22. It is shown that it was obtained from the document

ここでは、解析結果格納部２３に格納されている複数の係り受け関係情報のうちの係り受け関係情報２３１〜２３３について説明したが、他の係り受け関係情報についても同様であるため、その詳しい説明を省略する。 Here, the dependency relationship information 231 to 233 among the plurality of dependency relationship information stored in the analysis result storage unit 23 has been described. However, the same applies to the other dependency relationship information. Is omitted.

再び図４に戻ると、用語集計部３３は、解析結果格納部２３に格納されている係り受け関係情報に基づいて用語の出現頻度を集計する（ステップＳ３）。具体的には、用語集計部３３は、解析結果格納部２３に格納されている係り受け関係情報に含まれる用語１毎に、解析結果格納部２３に格納されている全ての係り受け関係情報において当該用語１が出現する頻度（出現頻度）を示す出現頻度情報を取得する。 Returning to FIG. 4 again, the term totaling unit 33 totals the appearance frequency of terms based on the dependency relationship information stored in the analysis result storage unit 23 (step S3). Specifically, the term totaling unit 33 determines whether all of the dependency relationship information stored in the analysis result storage unit 23 for each term 1 included in the dependency relationship information stored in the analysis result storage unit 23. Appearance frequency information indicating the frequency of appearance of the term 1 (appearance frequency) is acquired.

なお、用語集計部３３によって取得された出現頻度情報は、用語集計結果として用語集計結果格納部２４に格納される。 The appearance frequency information acquired by the term totaling unit 33 is stored in the term totaling result storage unit 24 as a term totaling result.

ここで、図７は、用語集計結果格納部２４のデータ構造の一例を示す。なお、図７に示す用語集計結果格納部２４には、出現頻度情報２４１および２４２を含む複数の出現頻度情報が格納されている。出現頻度情報は、解析結果格納部２３に格納されている係り受け関係情報に含まれる用語１の当該解析結果格納部２３に格納されている全ての係り受け関係情報（つまり、文書データベース２２に格納されている複数の文書）における出現頻度を示す。 Here, FIG. 7 shows an example of the data structure of the term aggregation result storage unit 24. Note that a plurality of appearance frequency information including appearance frequency information 241 and 242 is stored in the term aggregation result storage unit 24 illustrated in FIG. The appearance frequency information is stored in all the dependency relationship information stored in the analysis result storage unit 23 of the term 1 included in the dependency relationship information stored in the analysis result storage unit 23 (that is, stored in the document database 22). The appearance frequency in a plurality of documents).

用語集計結果格納部２４に格納されている出現頻度情報には、当該出現頻度情報を識別するための出現頻度情報ＩＤ、用語および出現頻度が対応づけて含まれる。なお、出現頻度情報ＩＤは、用語集計部３３によって取得された出現頻度情報毎に付与される。 The appearance frequency information stored in the term aggregation result storage unit 24 includes an appearance frequency information ID for identifying the appearance frequency information, a term, and an appearance frequency in association with each other. Note that the appearance frequency information ID is assigned to each appearance frequency information acquired by the term totaling unit 33.

用語は、対応づけられている出現頻度情報ＩＤによって識別される出現頻度情報によって出現頻度が示される用語を示し、解析結果格納部２３に格納されている係り受け関係情報に含まれる用語１（名詞）である。出現頻度は、対応づけられている用語の解析結果格納部２３に格納されている全ての係り受け関係情報における出現頻度である。 The term indicates a term whose appearance frequency is indicated by the appearance frequency information identified by the corresponding appearance frequency information ID, and the term 1 (noun) included in the dependency relationship information stored in the analysis result storage unit 23. ). The appearance frequency is an appearance frequency in all the dependency relationship information stored in the analysis result storage unit 23 of the associated term.

図７に示す例えば出現頻度情報２４１には、出現頻度情報ＩＤ「１」、用語「鉛筆」および頻度「１」が含まれる。この出現頻度情報ＩＤ「１」によって示される出現頻度情報２４１によれば、用語「鉛筆」の出現頻度が１であることが示されている。 For example, the appearance frequency information 241 illustrated in FIG. 7 includes the appearance frequency information ID “1”, the term “pencil”, and the frequency “1”. The appearance frequency information 241 indicated by the appearance frequency information ID “1” indicates that the appearance frequency of the term “pencil” is 1.

また、図７に示す例えば出現頻度情報２４２には、出現頻度情報ＩＤ「２」、用語「消しゴム」および頻度「４」が含まれる。この出現頻度情報ＩＤ「２」によって示される出現頻度情報２４２によれば、用語「消しゴム」の出現頻度が４であることが示されている。 For example, the appearance frequency information 242 illustrated in FIG. 7 includes an appearance frequency information ID “2”, the term “eraser”, and a frequency “4”. According to the appearance frequency information 242 indicated by the appearance frequency information ID “2”, it is indicated that the appearance frequency of the term “eraser” is 4.

ここでは、用語集計結果格納部２４に格納されている複数の出現頻度情報のうちの出現頻度情報２４１および２４２について説明したが、他の出現頻度情報についても同様であるため、その詳しい説明を省略する。 Here, the appearance frequency information 241 and 242 of the plurality of appearance frequency information stored in the term totalization result storage unit 24 has been described. However, the same applies to other appearance frequency information, and detailed description thereof is omitted. To do.

再び図４に戻ると、類似度算出部３４は、解析結果格納部２３および用語集計結果格納部２４を参照して類似度算出処理を実行する（ステップＳ４）。この類似度算出処理によれば、解析結果格納部２３に格納された係り受け関係情報に含まれる用語１の中から２つの用語１が抽出され、当該２つの用語１の類似度が算出される。なお、類似度算出処理の詳細については後述する。 Returning to FIG. 4 again, the similarity calculation unit 34 refers to the analysis result storage unit 23 and the term aggregation result storage unit 24 to execute the similarity calculation process (step S4). According to the similarity calculation process, two terms 1 are extracted from the terms 1 included in the dependency relation information stored in the analysis result storage unit 23, and the similarities between the two terms 1 are calculated. . Details of the similarity calculation process will be described later.

類似度算出処理が実行されると、類似度算出部３４によって算出された類似度は、類似度算出結果格納部２５に格納される。 When the similarity calculation process is executed, the similarity calculated by the similarity calculation unit 34 is stored in the similarity calculation result storage unit 25.

次に、クラスタ生成部３５は、文書データベース２２に格納されている複数の文書が属するクラスタを生成する（ステップＳ５）。上記したように文書データベース２２に格納されている文書に分類コードが含まれている場合には、クラスタ生成部３５は、当該分類コードに基づいてクラスタを生成する。ここでは、図８に示すように、文書データベース２２に格納されている複数の文書のうち、例えば分類コード「Ａ」が含まれている文書が属するクラスタとしてクラスタ１、分類コード「Ｂ」が含まれている文書が属するクラスタとしてクラスタ２、分類コード「Ｃ」が含まれている文書が属するクラスタとしてクラスタ３がクラスタ生成部３５によって生成されたものとする。 Next, the cluster generation unit 35 generates a cluster to which a plurality of documents stored in the document database 22 belong (step S5). As described above, when the classification code is included in the document stored in the document database 22, the cluster generation unit 35 generates a cluster based on the classification code. Here, as shown in FIG. 8, among the plurality of documents stored in the document database 22, for example, the cluster to which the document including the classification code “A” belongs includes cluster 1 and the classification code “B”. It is assumed that cluster 2 is generated by the cluster generation unit 35 as a cluster to which a document including the classification code “C” belongs, and cluster 2 as a cluster to which the document being included belongs.

なお、本実施形態においては文書データベース２２に格納されている複数の文書に含まれる各分類コードに対して１対１で対応するクラスタを生成するが、例えば当該分類コードが複数桁であるような場合には、その上位Ｎ桁に対して１対１で対応するクラスタを生成しても構わない。つまり、複数の分類コードに対して１つのクラスタが生成されるような構成であっても構わない。 In the present embodiment, a one-to-one cluster corresponding to each classification code included in a plurality of documents stored in the document database 22 is generated. For example, the classification code has a plurality of digits. In this case, a cluster corresponding to the upper N digits on a one-to-one basis may be generated. That is, a configuration in which one cluster is generated for a plurality of classification codes may be used.

また、本実施形態においては文書データベース２２に格納されている文書の各々に分類コードが含まれているものとして説明したが、当該文書の各々に分類コードが含まれていない場合には、当該文書データベース２２に格納されている複数の文書に対して例えば自動クラスタリング処理を実行し、当該処理結果に基づいてクラスタが生成されても構わない。 In the present embodiment, each document stored in the document database 22 has been described as including a classification code. However, when each of the documents does not include a classification code, the document For example, an automatic clustering process may be executed on a plurality of documents stored in the database 22, and a cluster may be generated based on the processing result.

次に、特徴度算出部３６は、用語集計結果格納部２４に格納されている出現頻度情報に含まれる用語毎に、クラスタ生成部３５によって生成された各クラスタに対する特徴度を算出する処理（以下、特徴度算出処理と表記）を実行する（ステップＳ６）。この特徴度算出処理においては、用語集計結果格納部２４に格納されている出現頻度情報に含まれる用語の出現頻度（つまり、当該出現頻度情報において当該用語に対応づけられている出現頻度）およびクラスタ生成部３５によって生成されたクラスタに属する文書における当該用語の出現頻度に基づいて、当該用語の当該クラスタに対する特徴度が算出される。なお、特徴度算出処理の詳細については後述する。 Next, the feature degree calculation unit 36 calculates a feature degree for each cluster generated by the cluster generation unit 35 for each term included in the appearance frequency information stored in the term aggregation result storage unit 24 (hereinafter, referred to as “feature level”). , Described as feature level calculation processing) (step S6). In this feature degree calculation process, the appearance frequency of terms included in the appearance frequency information stored in the term aggregation result storage unit 24 (that is, the appearance frequency associated with the term in the appearance frequency information) and the cluster Based on the appearance frequency of the term in the document belonging to the cluster generated by the generation unit 35, the feature degree of the term with respect to the cluster is calculated. Details of the feature calculation processing will be described later.

特徴度算出処理が実行されると、特徴度算出部３６によって算出された特徴度は、特徴度算出結果格納部２６に格納される。 When the feature degree calculation process is executed, the feature degree calculated by the feature degree calculation unit 36 is stored in the feature degree calculation result storage unit 26.

次に、類義語集合抽出部３７は、類似度算出結果格納部２５に格納された類似度および特徴度算出結果格納部２６に格納された特徴度に基づいて、類義語（の集合）を抽出する処理（以下、類義語集合抽出処理と表記）を実行する（ステップＳ７）。この類義語集合抽出処理においては、上記した類似度算出処理において抽出された２つの用語が類義語として抽出される。なお、類義語集合抽出処理の詳細については後述する。 Next, the synonym set extraction unit 37 extracts a synonym (set) based on the similarity stored in the similarity calculation result storage unit 25 and the feature degree stored in the feature degree calculation result storage unit 26. (Hereinafter referred to as synonym set extraction processing) is executed (step S7). In this synonym set extraction process, the two terms extracted in the above-described similarity calculation process are extracted as synonyms. Details of the synonym set extraction process will be described later.

類義語集合抽出処理が実行されると、類義語集合抽出部３７による処理結果（当該類義語集合抽出部３７によって類義語として抽出された２つの用語）は、類義語集合格納部２７に格納されるとともに、出力処理部３８を介して出力される（ステップＳ８）。 When the synonym set extraction processing is executed, the processing result by the synonym set extraction unit 37 (two terms extracted as synonyms by the synonym set extraction unit 37) is stored in the synonym set storage unit 27 and output processing The data is output via the unit 38 (step S8).

類義語集合格納部２７に格納された２つの用語は、類義語として例えば文書検索または文書分類等の処理に利用することができる。また、類義語集合抽出部３７によって抽出された２つの用語が出力されることにより、ユーザは、当該２つの用語が類義語として適切であるかを確認して、例えば当該２つの用語を類義語として登録すべきか否かを指示することができる。 The two terms stored in the synonym set storage unit 27 can be used as synonyms for processing such as document search or document classification. Further, by outputting the two terms extracted by the synonym set extraction unit 37, the user confirms whether the two terms are appropriate as synonyms, and, for example, should register the two terms as synonyms. Can be instructed.

次に、図９のフローチャートを参照して、前述した類似度算出処理（上記した図４に示すステップＳ４の処理）の処理手順について説明する。 Next, a processing procedure of the above-described similarity calculation process (the process of step S4 shown in FIG. 4 described above) will be described with reference to the flowchart of FIG.

まず、類似度算出部３４は、解析結果格納部２３に格納されている全ての係り受け関係情報を読み込む（ステップＳ１１）。 First, the similarity calculation unit 34 reads all the dependency relationship information stored in the analysis result storage unit 23 (step S11).

次に、類似度算出部３４は、解析結果格納部２３から読み込まれた係り受け関係情報に含まれる用語２（係り先の用語）および関係（係り受け関係）を１つの組として、当該読み込まれた全ての係り受け関係情報において当該用語２および関係の組に対応づけられている用語１（係り元の用語）の異なり数をカウントする（ステップＳ１２）。換言すれば、類似度算出部３４は、解析結果格納部２３から読み込まれた係り受け関係情報の各々に含まれる用語１のうち、同一の用語２と同一の係り受け関係にある用語１の異なり数をカウントする。なお、ステップＳ１２の処理は、解析結果格納部２３から読み込まれた係り受け関係情報の各々に含まれる用語２および関係の全ての組について実行される。 Next, the similarity calculation unit 34 reads the term 2 (relationship term) and the relationship (dependency relationship) included in the dependency relationship information read from the analysis result storage unit 23 as one set. In all of the dependency relationship information, the number of differences between the term 2 and the term 1 (relationship source term) associated with the set of relationships is counted (step S12). In other words, the similarity calculation unit 34 is different from the term 1 having the same dependency relationship as the same term 2 among the terms 1 included in the dependency relationship information read from the analysis result storage unit 23. Count the number. Note that the process of step S12 is executed for all sets of term 2 and relationships included in each of the dependency relationship information read from the analysis result storage unit 23.

次に、類似度算出部３４は、解析結果格納部２３を参照して、ステップＳ１２の処理においてカウントされた異なり数が２以上である場合における用語２および関係の組に対応づけられている各用語１の出現頻度をカウントする（ステップＳ１３）。具体的には、上記したように係り受け関係情報に含まれる用語２と関係との組に対応づけられている用語１の異なり数が２以上である場合、類似度算出部３４は、当該用語１毎に、当該用語１、当該用語２および当該関係を含む係り受け関係情報の出現頻度（数）をカウントする。 Next, the similarity calculation unit 34 refers to the analysis result storage unit 23, and associates each of the terms 2 and the relations associated with the pair of terms 2 when the number of differences counted in the process of step S 12 is 2 or more. The appearance frequency of term 1 is counted (step S13). Specifically, as described above, when the number of different terms 1 associated with a set of terms 2 and relationships included in the dependency relationship information is 2 or more, the similarity calculation unit 34 For every one, the appearance frequency (number) of the dependency relationship information including the term 1, the term 2, and the relationship is counted.

類似度算出部３４は、ステップＳ１２およびＳ１３の処理結果をもとに、類似度算出処理における中間処理結果を示す中間処理結果情報を生成する（ステップＳ１４）。中間処理結果情報は、ステップＳ１３において出現頻度がカウントされた用語１（、用語２および関係の組）毎に生成される。類似度算出部３４によって生成された中間処理結果情報には、当該中間処理結果情報を識別するための中間処理結果ＩＤ、ステップＳ１３において出現頻度がカウントされた係り受け関係情報に含まれる用語１、用語２および関係の組、および当該カウントされた出現頻度（以下、係り受け関係出現頻度と表記）が含まれる。 The similarity calculation unit 34 generates intermediate process result information indicating an intermediate process result in the similarity calculation process based on the process results of steps S12 and S13 (step S14). The intermediate processing result information is generated for each term 1 (the term 2 and the set of relations) whose appearance frequency is counted in step S13. The intermediate process result information generated by the similarity calculation unit 34 includes an intermediate process result ID for identifying the intermediate process result information, the term 1 included in the dependency relation information whose appearance frequency is counted in step S13, The term 2 and the set of relationships, and the counted appearance frequency (hereinafter referred to as dependency relationship appearance frequency) are included.

類似度算出部３４は、生成された中間処理結果情報に基づいて２つの用語１からなる全ての組み合わせを抽出する（ステップＳ１５）。具体的には、類似度算出部３４は、生成された中間処理結果情報に基づいて、同一の用語２と同一の係り受け関係にある用語１の中の２つの用語１からなる全ての組み合わせ（ここでは、順列）を抽出する。 The similarity calculation unit 34 extracts all combinations of the two terms 1 based on the generated intermediate processing result information (step S15). Specifically, based on the generated intermediate processing result information, the similarity calculation unit 34 selects all combinations of two terms 1 in terms 1 having the same dependency relationship as the same term 2 ( Here, a permutation) is extracted.

以下、ステップＳ１４において抽出された２つの用語１の組み合わせにおいて、前方の用語１を用語Ａ、後方の用語１を用語Ｂとする。また、この用語Ａおよび用語Ｂからなる組み合わせを単に用語Ａおよび用語Ｂの組と称する。 Hereinafter, in the combination of the two terms 1 extracted in step S14, the front term 1 is term A and the rear term 1 is term B. The combination of terms A and B is simply referred to as a set of terms A and B.

ステップＳ１４が実行されると、類似度算出部３４は、抽出された用語Ａおよび用語Ｂの組の各々を類似度算出結果格納部２５に格納する。このとき、類似度算出結果格納部２５には、用語Ａおよび用語Ｂに加えて、当該用語Ａに対応づけて中間処理結果情報に含まれる係り受け関係出現頻度等が格納される。 When step S <b> 14 is executed, the similarity calculation unit 34 stores each of the extracted term A and term B pairs in the similarity calculation result storage unit 25. At this time, in addition to the terms A and B, the similarity calculation result storage unit 25 stores the dependency relationship appearance frequency included in the intermediate processing result information in association with the term A.

次に、ステップＳ１６およびＳ１７の処理において、類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの文脈類似度が算出される。 Next, in the processes of steps S16 and S17, the context similarity of terms A and B stored in the similarity calculation result storage unit 25 is calculated.

類似度算出部３４に含まれる文脈類似度算出部３４１は、用語集計結果格納部２４に格納されている出現頻度情報に基づいて用語Ａの出現頻度を取得する（ステップＳ１６）。この場合、文脈類似度算出部３４１は、用語集計結果格納部２４に格納されている出現頻度情報において用語Ａに対応づけられている出現頻度を取得する。文脈類似度算出部３４１によって取得された用語Ａの出現頻度は、当該用語Ａに対応づけて類似度算出結果格納部２５に格納される。 The context similarity calculation unit 341 included in the similarity calculation unit 34 acquires the appearance frequency of the term A based on the appearance frequency information stored in the term aggregation result storage unit 24 (step S16). In this case, the context similarity calculation unit 341 acquires the appearance frequency associated with the term A in the appearance frequency information stored in the term aggregation result storage unit 24. The appearance frequency of the term A acquired by the context similarity calculation unit 341 is stored in the similarity calculation result storage unit 25 in association with the term A.

なお、ステップＳ１６の処理は、類似度算出結果格納部２５に格納された用語Ａ毎に実行される。これにより、ステップＳ１６の処理が実行された場合には、用語Ａ毎の出現頻度が当該用語Ａに対応づけて類似度算出結果格納部２５に格納される。 The process of step S16 is executed for each term A stored in the similarity calculation result storage unit 25. Thereby, when the process of step S16 is performed, the appearance frequency for each term A is stored in the similarity calculation result storage unit 25 in association with the term A.

次に、文脈類似度算出部３４１は、類似度算出結果格納部２５を参照して、当該類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの文脈類似度を算出する（ステップＳ１７）。用語Ａおよび用語Ｂの文脈類似度は、当該用語Ａおよび用語Ｂに対応づけて類似度算出結果格納部２５に格納された係り受け関係出現頻度および用語Ａの出現頻度を用いて、「係り受け関係出現頻度／用語Ａの出現頻度」によって算出される。なお、文脈類似度算出部３４１によって算出された文脈類似度が閾値（例えば、０．２５）以下である場合には、当該文脈類似度は０であるものとする。 Next, the context similarity calculation unit 341 refers to the similarity calculation result storage unit 25 and calculates the context similarity of terms A and B stored in the similarity calculation result storage unit 25 (step S17). ). The context similarity of terms A and B is determined by using the dependency relationship appearance frequency and the appearance frequency of term A stored in the similarity calculation result storage unit 25 in association with the terms A and B. It is calculated by “relationship appearance frequency / appearance frequency of term A”. When the context similarity calculated by the context similarity calculation unit 341 is equal to or less than a threshold (for example, 0.25), the context similarity is assumed to be 0.

文脈類似度算出部３４１によって算出された用語Ａおよび用語Ｂの文脈類似度は、当該用語Ａおよび用語Ｂに対応づけて類似度算出結果格納部２５に格納される。 The context similarity of terms A and B calculated by the context similarity calculation unit 341 is stored in the similarity calculation result storage unit 25 in association with the terms A and B.

なお、ステップＳ１７の処理は、類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの組毎に実行される。これにより、ステップＳ１７の処理が実行された場合には、用語Ａおよび用語Ｂの組毎の文脈類似度が類似度算出結果格納部２５に格納される。 The process of step S17 is executed for each set of terms A and B stored in the similarity calculation result storage unit 25. Thereby, when the process of step S <b> 17 is executed, the context similarity for each set of the terms A and B is stored in the similarity calculation result storage unit 25.

次に、ステップＳ１８およびＳ１９の処理において、類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの文字列類似度が算出される。 Next, in the processes of steps S18 and S19, the character string similarity of terms A and B stored in the similarity calculation result storage unit 25 is calculated.

類似度算出部３４に含まれる文字列類似度算出部３４２は、類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂにおいて共通する文字数（以下、用語Ａおよび用語Ｂの共通文字数と表記）をカウントする（ステップＳ１８）。 The character string similarity calculation unit 342 included in the similarity calculation unit 34 includes the number of characters common to the terms A and B stored in the similarity calculation result storage unit 25 (hereinafter referred to as the common character number of the terms A and B). ) Is counted (step S18).

次に、文字列類似度算出部３４２は、カウントされた用語Ａおよび用語Ｂの共通文字数に基づいて当該用語Ａおよび用語Ｂの文字列類似度を算出する（ステップＳ１９）。用語Ａおよび用語Ｂの文字列類似度は、「用語Ａおよび用語Ｂの共通文字数／用語Ａの文字数」によって算出される。なお、文字列類似度算出部３４２によって算出された文字列類似度は、当該用語Ａおよび用語Ｂに対応づけて類似度算出結果格納部２５に格納される。 Next, the character string similarity calculation unit 342 calculates the character string similarity of the terms A and B based on the counted number of common characters of the terms A and B (step S19). The string similarity of terms A and B is calculated by “number of common characters of terms A and B / number of characters of term A”. The character string similarity calculated by the character string similarity calculation unit 342 is stored in the similarity calculation result storage unit 25 in association with the terms A and B.

なお、上記したステップＳ１８およびＳ１９の処理は、類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの組毎に実行される。これにより、ステップＳ１８およびＳ１９の処理が実行された場合には、用語Ａおよび用語Ｂの組毎の文字列類似度が類似度算出結果格納部２５に格納される。 Note that the processing in steps S18 and S19 described above is executed for each set of terms A and B stored in the similarity calculation result storage unit 25. Thereby, when the processing of steps S18 and S19 is executed, the character string similarity for each set of terms A and B is stored in the similarity calculation result storage unit 25.

以下、図１０〜１４を参照して、上記した類似度算出処理について具体的に説明する。ここでは、解析結果格納部２３には、上記した図５および図６に示す複数の係り受け関係情報が格納されているものとする。また、用語集計結果格納部２４には、上記した図７に示す複数の出現頻度情報が格納されているものとする。 Hereinafter, the above-described similarity calculation processing will be described in detail with reference to FIGS. Here, it is assumed that the analysis result storage unit 23 stores a plurality of pieces of dependency relationship information shown in FIGS. 5 and 6 described above. Further, it is assumed that the term count result storage unit 24 stores a plurality of appearance frequency information shown in FIG.

まず、類似度算出部３４は、解析結果格納部２３から全ての係り受け関係情報を読み込む。次に、類似度算出部３４は、読み込まれた係り受け関係情報の各々に含まれる用語２および関係の組毎に、当該組に対応づけられている用語１の異なり数をカウントする。つまり、類似度算出部３４は、読み込まれた係り受け関係情報に基づいて、同一の用語２と同一の係り受け関係にある用語１の異なり数をカウントする。 First, the similarity calculation unit 34 reads all the dependency relationship information from the analysis result storage unit 23. Next, the similarity calculation unit 34 counts the number of differences of the term 1 associated with the set for each term 2 and relationship included in each of the read dependency relationship information. That is, the similarity calculation unit 34 counts the number of different terms 1 having the same dependency relationship as the same term 2 based on the read dependency relationship information.

ここで、上述した図５および図６を用いて具体的に説明すると、例えば用語２「持参」および関係「を」の組に対応づけられている用語１は、「鉛筆」、「消しゴム」および「時計」である。このため、図５および図６に示す係り受け関係情報において用語２「持参」および関係「を」の組に対応づけられている用語１（つまり、用語２「持参」と「を」の係り受け関係にある用語１）の異なり数は３となる。ここでは、用語２「持参」および関係「を」の組に対応づけられている用語１の異なり数ついて具体的に説明したが、他の用語２および関係の組に対応づけられている用語１の異なり数についても同様にカウントされる。 5 and FIG. 6 described above, for example, the term 1 associated with the set of the term 2 “bringing” and the relationship “wo” is “pencil”, “eraser” and It is a “clock”. For this reason, in the dependency relationship information shown in FIG. 5 and FIG. 6, the dependency of the term 2 “bring” and the relationship “w” is associated with the term 1 (that is, the term 2 “bring” and “w”). The number of differences in the related terms 1) is 3. Here, the different numbers of the term 1 associated with the term 2 “bringing” and the relationship “ha” are specifically described, but the term 1 associated with the other term 2 and the relationship set. The number of differences is counted in the same way.

次に、類似度算出部３４は、カウントされた異なり数が２以上である場合における用語２および関係の組に対応づけられている各用語１の出現頻度をカウントする。例えば上記した用語２「持参」および関係「を」の組に対応づけられている用語１の異なり数は２以上であるため、類似度算出部３４は、用語２「持参」および関係「を」の組に対応づけられている用語１「鉛筆」、「消しゴム」および「時計」の各々の出現頻度をカウントする。 Next, the similarity calculation unit 34 counts the appearance frequency of each term 1 associated with the term 2 and the set of relationships when the counted difference number is 2 or more. For example, since the difference number of the term 1 associated with the pair of the term 2 “bringing” and the relationship “ha” is 2 or more, the similarity calculation unit 34 uses the term 2 “bringing” and the relationship “ha”. The frequency of appearance of each of the terms 1 “pencil”, “eraser” and “clock” associated with the set of is counted.

ここで、上記したように用語２「持参」および関係「を」の組に対応づけられている用語１「鉛筆」の出現頻度について、図５および図６を用いて具体的に説明する。この場合、用語１「鉛筆」、用語２「持参」および関係「を」の組を含む係り受け関係情報の出現頻度（数）が解析結果格納部２３内でカウントされる。図５および図６によれば、用語２「持参」および関係「を」の組に対応づけられている用語１「鉛筆」の出現頻度は１となる。 Here, the appearance frequency of the term 1 “pencil” associated with the set of the term 2 “bringing” and the relationship “ha” as described above will be specifically described with reference to FIGS. 5 and 6. In this case, the appearance frequency (number) of the dependency relationship information including the set of the term 1 “pencil”, the term 2 “bringing” and the relationship “O” is counted in the analysis result storage unit 23. According to FIGS. 5 and 6, the appearance frequency of the term 1 “pencil” associated with the set of the term 2 “bringing” and the relationship “ha” is 1.

なお、詳しい説明は省略するが、図５および図６に示す係り受け関係情報によれば、用語２「持参」および関係「を」の組に対応づけられている用語１「消しゴム」および「時計」の出現頻度についても同様に１となる。 Although detailed explanation is omitted, according to the dependency relationship information shown in FIGS. 5 and 6, the terms 1 “eraser” and “clock” associated with the set of terms 2 “bringing” and relationship “ha” Similarly, the appearance frequency of “is also 1.

ここでは、用語２「持参」および関係「を」の組に対応づけられている用語１「鉛筆」、「消しゴム」および「時計」の各々の出現頻度について主に説明したが、異なり数が２以上である場合における用語２および関係の組に対応づけられている用語１の全てについて出現頻度がカウントされる。 Here, the appearance frequency of each of the terms 1 “pencil”, “eraser”, and “clock” associated with the set of the term 2 “bringing” and the relationship “wo” has been mainly explained. In the case described above, the appearance frequency is counted for all of the term 1 and the term 1 associated with the set of relations.

次に、類似度算出部３４は、中間処理結果情報を生成する。この場合、類似度算出部３４は、図１０に示す中間処理結果情報１０１〜１１０を生成する。 Next, the similarity calculation unit 34 generates intermediate processing result information. In this case, the similarity calculation unit 34 generates intermediate processing result information 101 to 110 illustrated in FIG.

図１０に示すように、中間処理結果情報１０１〜１１０には、中間処理結果ＩＤ、上記したように出現頻度がカウントされた係り受け関係情報に含まれる用語１、用語２および関係の組、および当該カウントされた出現頻度（係り受け関係出現頻度）が含まれる。 As shown in FIG. 10, the intermediate processing result information 101 to 110 includes an intermediate processing result ID, a term 1, a term 2, and a set of relationships included in the dependency relationship information whose appearance frequency is counted as described above, and The counted appearance frequency (dependency relationship appearance frequency) is included.

なお、中間処理結果情報１０１〜１１０に含まれる中間処理結果ＩＤは、中間処理結果情報を識別するための識別子であり、対応づけられている用語２および関係の組に対して付与された数値（識別子）と、対応づけられている用語１に対して付与された数値（識別子）とから構成される。 The intermediate process result ID included in the intermediate process result information 101 to 110 is an identifier for identifying the intermediate process result information, and is a numerical value ( An identifier) and a numerical value (identifier) assigned to the associated term 1.

図１０に示す例えば中間処理結果情報１０１には、中間処理結果ＩＤ「１−１」、用語１「鉛筆」、用語２「消しゴム」、関係「を」および出現頻度「１」が含まれている。この中間処理結果ＩＤ「１−１」によって識別される中間処理結果情報１０１によれば、用語１「鉛筆」、用語２「持参」、関係「を」（の組）を含む係り受け関係情報の出現頻度（係り受け関係出現頻度）が１であることが示されている。なお、中間処理結果ＩＤ「１−１」のうち、左側の「１」は、用語２「持参」および関係「を」の組に対して付与された数値であり、右側の「１」は、用語１「鉛筆」に対して付与された数値である。 For example, the intermediate processing result information 101 shown in FIG. 10 includes an intermediate processing result ID “1-1”, term 1 “pencil”, term 2 “eraser”, relationship “O”, and appearance frequency “1”. . According to the intermediate processing result information 101 identified by the intermediate processing result ID “1-1”, the dependency relationship information including the term 1 “pencil”, the term 2 “bringing”, and the relationship “O” (set) is included. It is shown that the appearance frequency (the dependency relationship appearance frequency) is 1. In the intermediate processing result ID “1-1”, “1” on the left side is a numerical value assigned to the set of the term 2 “bringing” and the relationship “O”, and “1” on the right side is It is a numerical value given to the term 1 “pencil”.

同様に、中間処理結果１０２には、中間処理結果ＩＤ「１−２」、用語１「消しゴム」、用語２「持参」、関係「を」および出現頻度「１」が含まれている。この中間処理結果ＩＤ「１−２」によって識別される中間処理結果情報１０２によれば、用語１「消しゴム」、用語２「持参」および関係「を」（の組）を含む係り受け関係情報の出現頻度が１であることが示されている。なお、中間処理結果ＩＤ「１−２」のうち、左側の「１」は、上記した中間処理結果情報１０１と同様に用語２「持参」および関係「を」の組に対して付与された数値であり、右側の「２」は、用語１「消しゴム」に対して付与された数値である。 Similarly, the intermediate processing result 102 includes the intermediate processing result ID “1-2”, the term 1 “eraser”, the term 2 “bringing”, the relationship “O”, and the appearance frequency “1”. According to the intermediate processing result information 102 identified by the intermediate processing result ID “1-2”, the dependency relationship information including the term 1 “eraser”, the term 2 “bringing” and the relationship “O” (a set) is included. It is shown that the appearance frequency is 1. In the intermediate process result ID “1-2”, “1” on the left side is a numerical value assigned to the set of the term 2 “bringing” and the relationship “to” in the same way as the intermediate process result information 101 described above. “2” on the right side is a numerical value assigned to the term 1 “eraser”.

更に、中間処理結果１０３には、中間処理結果ＩＤ「１−３」、用語１「時計」、用語２「持参」、関係「を」および出現頻度「１」が含まれている。この中間処理結果ＩＤ「１−３」によって識別される中間処理結果情報１０３によれば、用語１「時計」、用語２「持参」および関係「を」（の組）を含む係り受け関係情報の出現頻度が１であることが示されている。なお、中間処理結果ＩＤ「１−３」のうち、左側の「１」は、上記した中間処理結果情報１０１および１０２と同様に用語２「持参」および関係「を」の組に対して付与された数値であり、右側の「３」は、用語１「時計」に対して付与された数値である。 Further, the intermediate processing result 103 includes an intermediate processing result ID “1-3”, term 1 “clock”, term 2 “bringing”, relationship “O”, and appearance frequency “1”. According to the intermediate processing result information 103 identified by the intermediate processing result ID “1-3”, the dependency relationship information including the term 1 “clock”, the term 2 “bringing”, and the relationship “O” (set) is included. It is shown that the appearance frequency is 1. Note that “1” on the left side of the intermediate processing result ID “1-3” is assigned to the set of the term 2 “bringing” and the relationship “to” in the same manner as the intermediate processing result information 101 and 102 described above. “3” on the right side is a numerical value assigned to the term 1 “clock”.

なお、中間処理結果情報１０４〜１１０については詳しい説明は省略するが、例えば中間処理結果情報１０４および１０５では、用語２「記入」および関係「を」の組が共通しているため、当該中間処理結果情報１０４および１０５を識別するための中間処理結果ＩＤにおける左側の数値は共通している。同様に、中間処理結果情報１０６〜１０８では、用語２「押す」および関係「を」の組が共通しているため、当該中間処理結果情報１０６〜１０８を識別するための中間処理結果ＩＤにおける左側の数値は共通している。また、中間処理結果情報１０９および１１０についても、用語２「消す」および関係「で」の組が共通しているため、当該中間処理結果情報１０９および１１０を識別するための中間処理結果ＩＤにおける左側の数値は共通している。 Although detailed description of the intermediate processing result information 104 to 110 is omitted, for example, in the intermediate processing result information 104 and 105, since the set of the term 2 “entry” and the relationship “O” is common, the intermediate processing result information The numerical value on the left side in the intermediate processing result ID for identifying the result information 104 and 105 is common. Similarly, in the intermediate process result information 106 to 108, since the set of the term 2 “push” and the relationship “O” is common, the left side in the intermediate process result ID for identifying the intermediate process result information 106 to 108 The numbers are common. Also, the intermediate process result information 109 and 110 have the same term 2 “delete” and the relation “de”, so the left side in the intermediate process result ID for identifying the intermediate process result information 109 and 110 The numbers are common.

次に、類似度算出部３４は、中間処理結果情報１０１〜１１０において左側の数値が同じである中間処理結果ＩＤに対応づけられている用語１の中の２つの用語１からなる全ての組み合わせ（順列）を抽出する。 Next, the similarity calculation unit 34 includes all combinations of the two terms 1 among the terms 1 associated with the intermediate processing result ID having the same left numerical value in the intermediate processing result information 101 to 110 ( Extract permutation).

中間処理結果情報１０１〜１１０において例えば左側の数値が「１」である中間処理結果ＩＤ（ここでは、中間処理結果ＩＤ「１−１」、「１−２」および「１−３」）に対応づけられている用語１は、「鉛筆」、「消しゴム」および「時計」である。この場合、類似度算出部３４は、「鉛筆」および「消しゴム」、「鉛筆」および「時計」、「消しゴム」および「鉛筆」、「消しゴム」および「時計」、「時計」および「鉛筆」、「時計」および「消しゴム」の６つの組み合わせ（順列）を抽出する。なお、中間処理結果情報１０１〜１１０において例えば左側の数値が「２」である中間処理結果ＩＤおよび左側の数値が「３」である中間処理結果ＩＤに対応づけられている用語１についても同様である。 In the intermediate processing result information 101 to 110, for example, corresponding to the intermediate processing result ID (here, the intermediate processing result IDs “1-1”, “1-2”, and “1-3”) whose numerical value on the left side is “1”. The term 1 attached is “pencil”, “eraser” and “clock”. In this case, the similarity calculation unit 34 includes “pencil” and “eraser”, “pencil” and “clock”, “eraser” and “pencil”, “eraser” and “clock”, “clock” and “pencil”, Six combinations (permutations) of “clock” and “eraser” are extracted. In the intermediate processing result information 101 to 110, for example, the same applies to the term 1 associated with the intermediate processing result ID whose left numerical value is “2” and the intermediate processing result ID whose left numerical value is “3”. is there.

このように類似度算出部３４によって抽出された用語１の各組み合わせは、用語Ａおよび用語Ｂとして類似度算出結果格納部２５に格納される。 Thus, each combination of the terms 1 extracted by the similarity calculation unit 34 is stored in the similarity calculation result storage unit 25 as the terms A and B.

ここで、図１１は、用語Ａおよび用語Ｂの組が格納された後の類似度算出結果格納部２５のデータ構造の一例を示す。 Here, FIG. 11 shows an example of the data structure of the similarity calculation result storage unit 25 after the combination of the term A and the term B is stored.

図１１に示すように、類似度算出結果格納部２５には、組ＩＤ、元ＩＤ、元ＩＤ出現頻度、用語Ａおよび用語Ｂが対応づけて格納されている。 As illustrated in FIG. 11, the similarity calculation result storage unit 25 stores a set ID, an original ID, an original ID appearance frequency, a term A, and a term B in association with each other.

組ＩＤは、用語Ａおよび用語Ｂの組を識別するための識別子である。元ＩＤは、用語Ａに対応づけて図１０に示す中間処理結果情報に含まれる中間処理結果ＩＤである。元ＩＤ出現頻度は、用語Ａに対応づけて図１０に示す中間処理結果情報に含まれる出現頻度（係り受け関係出現頻度）である。 The set ID is an identifier for identifying a set of terms A and B. The original ID is an intermediate processing result ID included in the intermediate processing result information shown in FIG. The original ID appearance frequency is an appearance frequency (dependency relation appearance frequency) included in the intermediate processing result information shown in FIG.

図１１に示す例では、類似度算出結果格納部２５には、例えば組ＩＤ「１」、元ＩＤ「１−１」、元ＩＤ出現頻度「１」、用語Ａ「鉛筆」および用語Ｂ「消しゴム」が対応づけて格納されている。 In the example illustrated in FIG. 11, the similarity calculation result storage unit 25 includes, for example, the set ID “1”, the original ID “1-1”, the original ID appearance frequency “1”, the term A “pencil”, and the term B “eraser”. Are stored in association with each other.

ここでは、用語Ａ「鉛筆」および用語Ｂ「消しゴム」の組について説明したが、図１１に示すように他の用語Ａおよび用語Ｂの組についても同様である。 Here, the set of the term A “pencil” and the term B “eraser” has been described, but the same applies to other sets of the term A and the term B as shown in FIG.

次に、類似度算出部３４に含まれる文脈類似度算出部３４１は、図１１に示す類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの文脈類似度を算出する。 Next, the context similarity calculation unit 341 included in the similarity calculation unit 34 calculates the context similarity of the terms A and B stored in the similarity calculation result storage unit 25 illustrated in FIG.

この場合、文脈類似度算出部３４１は、図７に示す用語集計結果格納部２４に格納されている出現頻度情報において各用語Ａに対応づけられている出現頻度（当該用語Ａの出現頻度）を取得する。文脈類似度算出部３４１によって取得された各用語Ａの出現頻度は、図１２に示すように、当該用語Ａ（および用語Ｂ）に対応づけて類似度算出結果格納部２５に格納される。 In this case, the context similarity calculation unit 341 calculates an appearance frequency (appearance frequency of the term A) associated with each term A in the appearance frequency information stored in the term aggregation result storage unit 24 illustrated in FIG. get. The appearance frequency of each term A acquired by the context similarity calculation unit 341 is stored in the similarity calculation result storage unit 25 in association with the term A (and term B), as shown in FIG.

次に、文脈類似度算出部３４１は、類似度算出結果格納部２５において用語Ａおよび用語Ｂに対応づけられている元ＩＤ出現頻度（係り受け関係出現頻度）および用語Ａの出現頻度を用いて、当該用語Ａおよび用語Ｂの文脈類似度を算出する。 Next, the context similarity calculation unit 341 uses the original ID appearance frequency (dependency relationship appearance frequency) and the appearance frequency of the term A that are associated with the terms A and B in the similarity calculation result storage unit 25. The context similarity of terms A and B is calculated.

図１２に示す類似度算出結果格納部２５において、例えば用語Ａ「鉛筆」および用語Ｂ「消しゴム」に対応づけられている元ＩＤ出現頻度は１であり、用語Ａ「鉛筆」の出現頻度は１である。このため、用語Ａ「鉛筆」および用語Ｂ「消しゴム」の文脈類似度は、１／１＝１と算出される。 In the similarity calculation result storage unit 25 shown in FIG. 12, for example, the original ID appearance frequency associated with the term A “pencil” and the term B “eraser” is 1, and the appearance frequency of the term A “pencil” is 1. It is. For this reason, the context similarity of the term A “pencil” and the term B “eraser” is calculated as 1/1 = 1.

また、図１２に示す類似度算出結果格納部２５において、例えば用語Ａ「消しゴム」および用語Ｂ「鉛筆」に対応づけられている元ＩＤ出現頻度は１であり、用語Ａ「消しゴム」の出現頻度は４である。このため、用語Ａ「消しゴム」および用語Ｂ「鉛筆」の文脈類似度は、１／４＝０．２５と算出される。ここで、上記した閾値が０．２５であるものとすると、用語Ａ「消しゴム」および用語Ｂ「鉛筆」の文脈類似度は当該閾値以下であるため０となる。 In the similarity calculation result storage unit 25 shown in FIG. 12, for example, the original ID appearance frequency associated with the term A “eraser” and the term B “pencil” is 1, and the appearance frequency of the term A “eraser” Is 4. For this reason, the context similarity of the term A “eraser” and the term B “pencil” is calculated as ¼ = 0.25. Here, if the above-described threshold is 0.25, the context similarity of the term A “eraser” and the term B “pencil” is 0 because it is equal to or less than the threshold.

なお、図１２に示す類似度算出結果格納部２５に格納されている他の用語Ａおよび用語Ｂについても同様に文脈類似度が算出される。 Note that context similarities are similarly calculated for other terms A and B stored in the similarity calculation result storage unit 25 shown in FIG.

上記したように文脈類似度算出部３４１によって算出された用語Ａおよび用語Ｂの文脈類似度は、当該用語Ａおよび用語Ｂに対応づけて類似度算出結果格納部２５に格納される。なお、図１３は、用語Ａおよび用語Ｂの文脈類似度が格納された後の類似度算出結果格納部２５のデータ構造の一例を示す。 As described above, the context similarity of terms A and B calculated by the context similarity calculation unit 341 is stored in the similarity calculation result storage unit 25 in association with the terms A and B. FIG. 13 shows an example of the data structure of the similarity calculation result storage unit 25 after the context similarity of the terms A and B is stored.

次に、類似度算出部３４に含まれる文字列類似度算出部３４２は、類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの文字列類似度を算出する。 Next, the character string similarity calculation unit 342 included in the similarity calculation unit 34 calculates the character string similarity of terms A and B stored in the similarity calculation result storage unit 25.

この場合、文字列類似度算出部３４２は、類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの組毎に、当該用語Ａおよび用語Ｂの共通文字数をカウントする。図１３を用いて具体的に説明すると、類似度算出結果格納部２５に格納された例えば用語Ａ「鉛筆」および用語Ｂ「消しゴム」の場合、当該用語Ａ「鉛筆」および用語Ｂ「消しゴム」の共通文字数は０である。また、類似度算出結果格納部２５に格納された例えば用語Ａ「字消し」および用語Ｂ「消しゴム」の場合、当該用語Ａ「字消し」および用語Ｂ「消しゴム」の共通文字数は２である。 In this case, the character string similarity calculation unit 342 counts the number of common characters of the terms A and B for each set of terms A and B stored in the similarity calculation result storage unit 25. Specifically, referring to FIG. 13, for example, in the case of the term A “pencil” and the term B “eraser” stored in the similarity calculation result storage unit 25, the term A “pencil” and the term B “eraser” The number of common characters is zero. For example, in the case of the term A “eraser” and the term B “eraser” stored in the similarity calculation result storage unit 25, the number of common characters of the term A “eraser” and the term B “eraser” is two.

文字列類似度算出部３４２は、カウントされた用語Ａおよび用語Ｂの共通文字数および当該用語Ａの文字数に基づいて、当該用語Ａおよび用語Ｂの文字列類似度を算出する。例えば用語Ａ「鉛筆」および用語Ｂ「消しゴム」の場合、上記したように用語Ａ「鉛筆」および用語Ｂ「消しゴム」の共通文字数は０であり、用語Ａの文字数は２である。このため、用語Ａ「鉛筆」および用語Ｂ「消しゴム」の文字列類似度は、０／２＝０と算出される。また、例えば用語Ａ「字消し」および用語Ｂ「消しゴム」の場合、上記したように用語Ａ「字消し」および用語Ｂ「消しゴム」の共通文字数は２であり、用語Ａの文字数は３である。このため、用語Ａ「字消し」および用語Ｂ「消しゴム」の文字列類似度は、２／３≒０．６７と算出される。 The character string similarity calculation unit 342 calculates the character string similarity of the terms A and B based on the counted number of common characters of the terms A and B and the number of characters of the term A. For example, in the case of the term A “pencil” and the term B “eraser”, the common character number of the term A “pencil” and the term B “eraser” is 0, and the number of characters of the term A is 2. Therefore, the character string similarity of the term A “pencil” and the term B “eraser” is calculated as 0/2 = 0. Further, for example, in the case of the term A “eraser” and the term B “eraser”, the common character number of the term A “eraser” and the term B “eraser” is 2 and the number of characters of the term A is 3 as described above. . Therefore, the character string similarity of the term A “eraser” and the term B “eraser” is calculated as 2 / 3≈0.67.

なお、図１３に示す類似度算出結果格納部２５に格納されている他の用語Ａおよび用語Ｂについても同様に文字列類似度が算出される。 The character string similarity is similarly calculated for the other terms A and B stored in the similarity calculation result storage unit 25 shown in FIG.

上記したように文脈類似度算出部３４１によって算出された用語Ａおよび用語Ｂの文脈類似度は、当該用語Ａおよび用語Ｂに対応づけて類似度算出結果格納部２５に格納される。なお、図１４は、用語Ａおよび用語Ｂの文字列類似度が格納された後の類似度算出結果格納部２５のデータ構造の一例である。 As described above, the context similarity of terms A and B calculated by the context similarity calculation unit 341 is stored in the similarity calculation result storage unit 25 in association with the terms A and B. FIG. 14 is an example of the data structure of the similarity calculation result storage unit 25 after the character string similarity of terms A and B is stored.

次に、図１５のフローチャートを参照して、前述した特徴度算出処理（上記した図４に示すステップＳ６の処理）の処理手順について説明する。 Next, with reference to the flowchart of FIG. 15, the processing procedure of the above-described feature degree calculation processing (the processing of step S6 shown in FIG. 4 described above) will be described.

まず、特徴度算出部３６は、用語集計結果格納部２４に格納されている全ての出現頻度情報を読み込む（ステップＳ２１）。用語集計結果格納部２４から読み込まれた出現頻度情報は、特徴度算出結果格納部２６に格納される。これにより、用語集計結果格納部２４には、出現頻度情報ＩＤ、用語および当該用語の出現頻度が対応づけて格納される。 First, the feature degree calculation unit 36 reads all the appearance frequency information stored in the term aggregation result storage unit 24 (step S21). The appearance frequency information read from the term aggregation result storage unit 24 is stored in the feature degree calculation result storage unit 26. Thus, the term count result storage unit 24 stores the appearance frequency information ID, the term, and the appearance frequency of the term in association with each other.

次に、特徴度算出部３６は、上述した図４に示すステップＳ５においてクラスタ生成部３５によって生成されたクラスタ毎に、当該クラスタに属する文書（の集合）を文書データベース２２から取得する。 Next, for each cluster generated by the cluster generation unit 35 in step S5 shown in FIG. 4 described above, the feature calculation unit 36 acquires a document (set) belonging to the cluster from the document database 22.

特徴度算出部３６は、特徴度算出結果格納部２６に格納された用語の各クラスタにおける出現頻度をカウントする（ステップＳ２２）。具体的には、特徴度算出部３６は、取得された各クラスタに属する文書における用語の出現頻度をカウントする。なお、ステップＳ２２の処理は、特徴度算出結果格納部２６に格納された全ての用語について実行される。 The feature calculation unit 36 counts the appearance frequency of each term stored in the feature calculation result storage unit 26 in each cluster (step S22). Specifically, the feature calculation unit 36 counts the appearance frequency of terms in the acquired documents belonging to each cluster. Note that the process of step S <b> 22 is executed for all terms stored in the feature calculation result storage unit 26.

ステップＳ２２の処理が実行されると、特徴度算出部３６によってカウントされた特徴度算出結果格納部２６に格納された用語の各クラスタにおける出現頻度は、当該用語に対応づけて特徴度算出結果格納部２６に格納される。 When the process of step S22 is executed, the appearance frequency in each cluster of terms stored in the feature calculation result storage unit 26 counted by the feature calculation unit 36 is stored in the feature calculation result in association with the term. Stored in the unit 26.

次に、特徴度算出部３６は、特徴度算出結果格納部２６に格納された用語の出現頻度および当該用語の各クラスタにおける出現頻度に基づいて、当該用語の各クラスタに対する特徴度を算出する（ステップＳ２３）。なお、用語の各クラスタに対する特徴度は、値が大きいほど当該用語が当該クラスタにおいて特徴的な用語であることを表す。 Next, the feature degree calculation unit 36 calculates the feature degree for each cluster of the term based on the appearance frequency of the term stored in the feature degree calculation result storage unit 26 and the appearance frequency of the term in each cluster ( Step S23). In addition, the characteristic degree with respect to each cluster of a term represents that the term is a characteristic term in the cluster as the value is large.

用語のクラスタに対する特徴度は、「（当該用語のクラスタにおける出現頻度−１）／当該用語の出現頻度」によって算出される。なお、ステップＳ２３の処理は、特徴度算出結果格納部２６に格納された全ての用語について実行される。 The characteristic degree of a term with respect to a cluster is calculated by “(appearance frequency in the cluster of the term−1) / appearance frequency of the term”. Note that the process of step S23 is executed for all terms stored in the feature calculation result storage unit 26.

上記したステップＳ２３の処理が実行されると、特徴度算出部３６によって算出された用語の各クラスタに対する特徴度は、当該用語に対応づけて特徴度算出結果格納部２６に格納される（ステップＳ２４）。ステップＳ２４の処理が実行されると、特徴度算出処理は終了される。 When the process of step S23 described above is executed, the feature degrees for each cluster of terms calculated by the feature degree calculation unit 36 are stored in the feature degree calculation result storage unit 26 in association with the terms (step S24). ). When the process of step S24 is executed, the feature degree calculation process ends.

以下、図１６〜図１８を参照して、上記した特徴度算出処理について具体的に説明する。ここでは、用語集計結果格納部２４には、上記した図７に示す複数の出現頻度情報が格納されているものとする。また、上述した図８において説明したようにクラスタ１〜３がクラスタ生成部３５によって生成されたものとする。 Hereinafter, with reference to FIGS. 16 to 18, the above-described feature calculation processing will be specifically described. Here, it is assumed that the term count result storage unit 24 stores a plurality of pieces of appearance frequency information shown in FIG. Further, it is assumed that the clusters 1 to 3 are generated by the cluster generation unit 35 as described in FIG.

まず、特徴度算出部３６は、用語集計結果格納部２４から全ての出現頻度情報を読み込む。用語集計結果格納部２４から読み込まれた出現頻度情報は、特徴度算出結果格納部２６に格納される。これにより、特徴度算出結果格納部２６は、例えば図７と同様の情報が格納される。 First, the feature degree calculation unit 36 reads all appearance frequency information from the term aggregation result storage unit 24. The appearance frequency information read from the term aggregation result storage unit 24 is stored in the feature degree calculation result storage unit 26. Thereby, the feature degree calculation result storage unit 26 stores, for example, the same information as in FIG.

次に、特徴度算出部３６は、クラスタ生成部３５によって生成されたクラスタ１〜３の各々に属する文書を文書データベース２２から取得する。ここでは、上記したようにクラスタ１には分類コード「Ａ」を含む文書が属し、クラスタ２には分類コード「Ｂ」を含む文書が属し、クラスタ３には分類コード「Ｃ」を含む文書が属するものとする。このため、クラスタ１に属する文書として分類コード「Ａ」を含む文書が文書データベース２２から取得される。また、クラスタ２に属する文書として分類コード「Ｂ」を含む文書が文書データベース２２から取得される。また、クラスタ３に属する文書として分類コード「Ｃ」を含む文書が文書データベース２２から取得される。 Next, the feature calculation unit 36 acquires documents belonging to each of the clusters 1 to 3 generated by the cluster generation unit 35 from the document database 22. Here, as described above, a document including the classification code “A” belongs to the cluster 1, a document including the classification code “B” belongs to the cluster 2, and a document including the classification code “C” belongs to the cluster 3. Shall belong. Therefore, a document including the classification code “A” as a document belonging to the cluster 1 is acquired from the document database 22. A document including the classification code “B” as a document belonging to the cluster 2 is acquired from the document database 22. A document including the classification code “C” as a document belonging to the cluster 3 is acquired from the document database 22.

特徴度算出部３６は、特徴度算出結果格納部２６に格納された用語毎に、当該用語のクラスタ「１」〜「３」の各々における出現頻度（以下、クラスタ出現頻度と表記）をカウントする。この場合、特徴度算出部３６は、文書データベース２２から取得されたクラスタ１に属する文書における各用語の出現頻度、クラスタ２に属する文書における各用語の出現頻度およびクラスタ３に属する文書における各用語の出現頻度をカウントする。 For each term stored in the feature value calculation result storage unit 26, the feature value calculation unit 36 counts the appearance frequency (hereinafter referred to as cluster appearance frequency) in each of the clusters “1” to “3” of the term. . In this case, the feature calculation unit 36 obtains the appearance frequency of each term in the document belonging to the cluster 1 acquired from the document database 22, the appearance frequency of each term in the document belonging to the cluster 2, and each term in the document belonging to the cluster 3. Count appearance frequency.

特徴度算出部３６によってカウントされたクラスタ１〜３の各々におけるクラスタ出現頻度は、用語毎に特徴度算出結果格納部２６に格納される。なお、図１６は、用語毎に各クラスタ１〜３におけるクラスタ出現頻度が格納された後の特徴度算出結果格納部２６のデータ構造の一例を示す。図１６においては、便宜的に、クラスタ１におけるクラスタ出現頻度をクラスタ出現頻度１、クラスタ２におけるクラスタ出現頻度をクラスタ出現頻度２、クラスタ３におけるクラスタ出現頻度をクラスタ出現頻度３として示す。図１６に示す例では、特徴度算出結果格納部２６には、例えば用語「消しゴム」のクラスタ１におけるクラスタ出現頻度として１、クラスタ２におけるクラスタ出現頻度として３、クラスタ３におけるクラスタ出現頻度として０が格納されている。ここでは詳しい説明を省略するが、特徴度算出結果格納部２６には、図１６に示すように他の用語についても同様にクラスタ１〜３の各々におけるクラスタ出現頻度が格納されている。 The cluster appearance frequency in each of the clusters 1 to 3 counted by the feature degree calculation unit 36 is stored in the feature degree calculation result storage unit 26 for each term. FIG. 16 shows an example of the data structure of the feature calculation result storage unit 26 after the cluster appearance frequency in each of the clusters 1 to 3 is stored for each term. In FIG. 16, for the sake of convenience, the cluster appearance frequency in cluster 1 is shown as cluster appearance frequency 1, the cluster appearance frequency in cluster 2 is shown as cluster appearance frequency 2, and the cluster appearance frequency in cluster 3 is shown as cluster appearance frequency 3. In the example illustrated in FIG. 16, the feature calculation result storage unit 26 has, for example, 1 as the cluster appearance frequency in cluster 1 of the term “eraser”, 3 as the cluster appearance frequency in cluster 2, and 0 as the cluster appearance frequency in cluster 3. Stored. Although detailed explanation is omitted here, the feature calculation result storage unit 26 stores the cluster appearance frequency in each of the clusters 1 to 3 similarly for other terms as shown in FIG.

次に、特徴度算出部３６は、特徴度算出結果格納部２６に格納された用語毎に、当該用語の出現頻度および当該用語のクラスタ１〜３の各々におけるクラスタ出現頻度に基づいて、当該用語のクラスタ１〜３の各々に対する特徴度を算出する。なお、用語のクラスタＮ（ここでは、Ｎ＝１，２，３）に対する特徴度は、「（当該用語のクラスタＮにおけるクラスタ出現頻度−１）／当該用語の出現頻度」によって算出される。 Next, for each term stored in the feature value calculation result storage unit 26, the feature value calculation unit 36 calculates the term based on the appearance frequency of the term and the cluster appearance frequency in each of the clusters 1 to 3 of the term. The feature degree for each of the clusters 1 to 3 is calculated. It should be noted that the degree of feature for a term cluster N (here, N = 1, 2, 3) is calculated by “(cluster appearance frequency in the cluster N of the term−1) / appearance frequency of the term”.

ここで、図１６を参照して、特徴度算出結果格納部２６に格納されている例えば用語「消しゴム」のクラスタ１〜３の各々に対する特徴度について具体的に説明する。図１６に示すように、特徴度算出結果格納部２６に格納されている用語「消しゴム」の出現頻度は４である。また、特徴度算出結果格納部２６格納されている用語「消しゴム」のクラスタ１におけるクラスタ出現頻度は１である。この場合、用語「消しゴム」のクラスタ１に対する特徴度は、（１−１）／４＝０と算出される。また、図１６に示すように、特徴度算出結果格納部２６に格納されている用語「消しゴム」のクラスタ２におけるクラスタ出現頻度は３である。この場合、用語「消しゴム」のクラスタ２に対する特徴度は、（３−１）／４＝０．５と算出される。なお、図１６に示すように、特徴度算出結果格納部２６に格納されている用語「消しゴム」のクラスタ３におけるクラスタ出現頻度３は０である。この場合、用語「消しゴム」のクラスタ１に対する特徴度は、０と算出される。 Here, with reference to FIG. 16, the characteristic degree with respect to each of the clusters 1 to 3 of the term “eraser” stored in the characteristic degree calculation result storage unit 26 will be specifically described. As shown in FIG. 16, the appearance frequency of the term “eraser” stored in the feature calculation result storage unit 26 is 4. Further, the cluster appearance frequency of the term “eraser” stored in the feature degree calculation result storage unit 26 in the cluster 1 is 1. In this case, the characteristic degree of the term “eraser” with respect to cluster 1 is calculated as (1-1) / 4 = 0. As shown in FIG. 16, the cluster appearance frequency in the cluster 2 of the term “eraser” stored in the feature calculation result storage unit 26 is 3. In this case, the characteristic degree of the term “eraser” with respect to the cluster 2 is calculated as (3-1) /4=0.5. As shown in FIG. 16, the cluster appearance frequency 3 in the cluster 3 of the term “eraser” stored in the feature calculation result storage unit 26 is zero. In this case, the characteristic degree of the term “eraser” with respect to cluster 1 is calculated as 0.

なお、図１６に示す特徴度算出結果格納部２５に格納されている他の用語についても同様に特徴度が算出される。 It should be noted that the feature values are similarly calculated for other terms stored in the feature value calculation result storage unit 25 shown in FIG.

上記したように特徴度算出部３６によって算出された用語のクラスタ１〜３の各々に対する特徴度は、当該用語に対応づけて特徴度算出結果格納部２６に格納される。図１７は、各用語のクラスタ１〜３の各々に対する特徴度が格納された後の特徴度算出結果格納部２６のデータ構造の一例を示す。図１７においては、便宜的に、クラスタ１に対する特徴度を特徴度１、クラスタ２に対する特徴度を特徴度２、クラスタ３に対する特徴度を特徴度３として示す。なお、図１７に示す特徴度算出結果格納部２６に格納された各用語のクラスタ１〜３の各々に対する特徴度によれば、図１８に示すように、用語「消しゴム」、「専用インク」、「文字」および「字消し」はクラスタ「２」に対して特徴的な用語であり、用語「取り消し」、「入力」および「取りやめ」はクラスタ「３」に対して特徴的な用語である。 As described above, the feature degrees for each of the clusters 1 to 3 calculated by the feature degree calculation unit 36 are stored in the feature degree calculation result storage unit 26 in association with the term. FIG. 17 shows an example of the data structure of the feature value calculation result storage unit 26 after the feature values for each of the clusters 1 to 3 of each term are stored. In FIG. 17, for convenience, the feature degree for cluster 1 is shown as feature degree 1, the feature degree for cluster 2 is shown as feature degree 2, and the feature degree for cluster 3 is shown as feature degree 3. In addition, according to the characteristic degree with respect to each of the clusters 1 to 3 of each term stored in the characteristic degree calculation result storage unit 26 illustrated in FIG. 17, the terms “eraser”, “dedicated ink”, “Character” and “erasure” are characteristic terms for cluster “2”, and the terms “cancel”, “input” and “cancel” are characteristic terms for cluster “3”.

次に、図１９のフローチャートを参照して、前述した類義語集合抽出処理（上記した図４に示すステップＳ７の処理）の処理手順について説明する。 Next, a processing procedure of the above-described synonym set extraction process (the process of step S7 shown in FIG. 4 described above) will be described with reference to the flowchart of FIG.

まず、類義語集合抽出部３７は、類似度算出結果格納部２５から類似度算出結果を読み込む（ステップＳ３１）。この場合、類義語集合抽出部３７は、類似度算出結果として、類似度算出結果格納部２５に格納されている用語Ａおよび用語Ｂの組、当該用語Ａおよび用語Ｂの組に対応づけられている文脈類似度（用語Ａおよび用語Ｂの文脈類似度）および文字列類似度（用語Ａおよび用語Ｂの文字列類似度）を読み込む。類似度算出結果格納部２５から読み込まれた類似度算出結果（用語Ａおよび用語Ｂの組、文脈類似度および文字列類似度）は、類義語集合格納部２７に格納される。 First, the synonym set extraction unit 37 reads the similarity calculation result from the similarity calculation result storage unit 25 (step S31). In this case, the synonym set extraction unit 37 is associated with the set of terms A and B and the set of terms A and B stored in the similarity calculation result storage unit 25 as the similarity calculation result. The context similarity (context similarity of terms A and B) and the character string similarity (character string similarity of terms A and B) are read. The similarity calculation result (a set of terms A and B, context similarity, and character string similarity) read from the similarity calculation result storage unit 25 is stored in the synonym set storage unit 27.

次に、類義語集合抽出部３７は、クラスタ生成部３５によって生成されたクラスタの各々について以下のステップＳ３２およびＳ３３を実行する。この処理の対象となるクラスタを対象クラスタと称する。 Next, the synonym set extraction unit 37 executes the following steps S32 and S33 for each of the clusters generated by the cluster generation unit 35. A cluster to be processed is referred to as a target cluster.

類義語集合抽出部３７は、特徴度算出結果格納部２６から特徴度算出結果を読み込む（ステップＳ３２）。この場合、類義語集合抽出部３７は、特徴度算出結果として、類義語集合格納部２７に格納された用語Ａの対象クラスタに対する特徴度および用語Ｂの対象クラスタに対する特徴度を読み込む。特徴度算出結果格納部２６から読み込まれた特徴度算出結果（用語Ａの対象クラスタに対する特徴度および用語Ｂの対象クラスタに対する特徴度）は、類義語集合格納部２７に格納される。 The synonym set extraction unit 37 reads the feature value calculation result from the feature value calculation result storage unit 26 (step S32). In this case, the synonym set extraction unit 37 reads the feature degree for the target cluster of the term A and the feature degree for the target cluster of the term B stored in the synonym set storage unit 27 as the feature degree calculation result. The feature value calculation results read from the feature value calculation result storage unit 26 (feature values for the target cluster of the term A and feature values for the target cluster of the term B) are stored in the synonym set storage unit 27.

類義語集合抽出部３７は、類義語集合格納部２７に格納された用語Ａおよび用語Ｂの組毎に、当該類義語集合格納部２７に格納された当該用語Ａおよび用語Ｂの文脈類似度、当該用語Ａおよび用語Ｂの文字列類似度、当該用語Ａの対象クラスタに対する特徴度および用語Ｂの対象クラスタに対する特徴度に基づいて当該用語Ａおよび用語Ｂが類義語（の集合）であるか否かを判定する。この場合、例えば用語Ａおよび用語Ｂの文脈類似度、用語Ａおよび用語Ｂの文字列類似度、用語Ａの対象クラスタに対する特徴度、および用語Ｂの対象クラスタに対する特徴度の４つの値の積が０より大きい場合、当該用語Ａおよび用語Ｂは類義語であると判定される。換言すれば、用語Ａおよび用語Ｂの文脈類似度、用語Ａおよび用語Ｂの文字列類似度、用語Ａの対象クラスタに対する特徴度、および用語Ｂの対象クラスタに対する特徴度の４つの値の全てが０でなければ、当該用語Ａおよび用語Ｂは類義語であると判定される。 For each set of terms A and B stored in the synonym set storage unit 27, the synonym set extraction unit 37 determines the context similarity of the terms A and B stored in the synonym set storage unit 27, the terms A And whether the term A and the term B are synonyms based on the character string similarity of the term B, the feature degree of the term A with respect to the target cluster, and the feature degree of the term B with respect to the target cluster. . In this case, for example, the product of four values of the context similarity of terms A and B, the string similarity of terms A and B, the feature of term A for the target cluster, and the feature of term B for the target cluster If it is greater than 0, the terms A and B are determined to be synonyms. In other words, all of the four values of the context similarity of terms A and B, the string similarity of terms A and B, the features of term A for the target cluster, and the features of term B for the target cluster are all If it is not 0, the terms A and B are determined to be synonyms.

このような判定処理が類義語集合格納部２７に格納された用語Ａおよび用語Ｂの全ての組に対して実行されることによって、類義語集合抽出部３７は、用語Ａおよび用語Ｂを類義語として抽出する（ステップＳ３３）。 By executing such determination processing for all pairs of the term A and the term B stored in the synonym set storage unit 27, the synonym set extraction unit 37 extracts the term A and the term B as synonyms. (Step S33).

次に、類義語集合抽出部３７は、クラスタ生成部３５によって生成された全てのクラスタについてステップＳ３２およびＳ３３の処理が実行されたか否かを判定する（ステップＳ３４）。 Next, the synonym set extraction unit 37 determines whether or not the processing of steps S32 and S33 has been executed for all the clusters generated by the cluster generation unit 35 (step S34).

全てのクラスタについて処理が実行されていないと判定された場合（ステップＳ３４のＮＯ）、上記したステップＳ３２に戻って処理が繰り返される。この場合、ステップＳ３２およびＳ３３の処理が実行されていないクラスタを対象クラスタとして処理が実行される。 When it is determined that the processing has not been executed for all the clusters (NO in step S34), the process returns to the above-described step S32 and the processing is repeated. In this case, the process is executed with the cluster for which the processes of steps S32 and S33 have not been executed as the target cluster.

一方、全てのクラスタについて処理が実行されたと判定された場合（ステップＳ３４のＹＥＳ）、類義語集合抽出処理は終了される。 On the other hand, if it is determined that the processing has been executed for all the clusters (YES in step S34), the synonym set extraction processing is terminated.

以下、図２０〜図２４を参照して、上記した類義語集合抽出処理について具体的に説明する。ここでは、類似度算出結果格納部２５には、上記した図１４に示す各種情報が格納されているものとする。また、特徴度算出結果格納部２６には、上記した図１７に示す各種情報が格納されているものとする。また、上述した図８において説明したようにクラスタ１〜３がクラスタ生成部３５によって生成されたものとする。 Hereinafter, the above-described synonym set extraction processing will be specifically described with reference to FIGS. Here, it is assumed that the similarity calculation result storage unit 25 stores various types of information illustrated in FIG. Further, it is assumed that the characteristic degree calculation result storage unit 26 stores various types of information shown in FIG. Further, it is assumed that the clusters 1 to 3 are generated by the cluster generation unit 35 as described in FIG.

まず、類義語集合抽出部３７は、類似度算出結果として、類似度算出結果格納部２５に対応づけて格納されている組ＩＤ、用語Ａおよび用語Ｂの組、文脈類似度および文字列類似度の全てを当該類似度算出結果格納部２５から読み込む。類似度算出結果格納部２５から読み込まれた類似度算出結果（組ＩＤ、用語Ａおよび用語Ｂの組、文脈類似度および文字列類似度）は、類義語集合格納部２７に格納される。図２０は、類似度算出結果が格納された後の類義語集合格納部２７のデータ構造の一例を示す。 First, the synonym set extraction unit 37 sets a set ID, a set of terms A and B, a context similarity, and a character string similarity stored in association with the similarity calculation result storage unit 25 as a similarity calculation result. All are read from the similarity calculation result storage unit 25. The similarity calculation result (a set ID, a set of terms A and B, a context similarity, and a character string similarity) read from the similarity calculation result storage unit 25 is stored in the synonym set storage unit 27. FIG. 20 shows an example of the data structure of the synonym set storage unit 27 after the similarity calculation result is stored.

以下、クラスタ生成部３５によって生成されたクラスタ１〜３の各々について処理が実行されるが、ここではクラスタ２について具体的に説明する。 Hereinafter, the processing is executed for each of the clusters 1 to 3 generated by the cluster generation unit 35. Here, the cluster 2 will be described in detail.

この場合、類義語集合抽出部３７は、特徴度算出結果として、類義語集合格納部２７に格納された用語Ａのクラスタ２に対する特徴度および用語Ｂのクラスタ２に対する特徴度を特徴度算出結果格納部２６から読み込む。 In this case, the synonym set extraction unit 37 obtains the characteristic degree of the term A with respect to the cluster 2 and the characteristic degree of the term B with respect to the cluster 2 stored in the synonym set storage unit 27 as the characteristic degree calculation result. Read from.

特徴度算出結果格納部２６から読み込まれた特徴度算出結果（用語Ａのクラスタ２に対する特徴度および用語Ｂのクラスタ２に対する特徴度）は、類義語集合格納部２７に格納される。なお、ここで類義語集合格納部２７に格納された用語Ａのクラスタ２に対する特徴度および用語Ｂのクラスタ２に対する特徴度は、単に用語Ａの特徴度および用語Ｂの特徴度とする。図２１は、特徴度算出結果が格納された後の類義語集合格納部２７のデータ構造の一例を示す。 The feature value calculation results read from the feature value calculation result storage unit 26 (the feature values of the term A with respect to the cluster 2 and the feature values of the term B with respect to the cluster 2) are stored in the synonym set storage unit 27. Here, the characteristic degree of the term A with respect to the cluster 2 and the characteristic degree of the term B with respect to the cluster 2 stored in the synonym set storage unit 27 are simply the characteristic degree of the term A and the characteristic degree of the term B. FIG. 21 shows an example of the data structure of the synonym set storage unit 27 after the feature calculation result is stored.

次に、類義語集合抽出部３７は、類義語集合格納部２７に格納された用語Ａおよび用語Ｂの組毎に、当該類義語集合格納部２７に格納された当該用語Ａおよび用語Ｂの文脈類似度、当該用語Ａおよび用語Ｂの文字列類似度、当該用語Ａの特徴度および用語Ｂの特徴度に基づいて当該用語Ａおよび用語Ｂが類義語（の集合）であるか否かを判定する。上記したように用語Ａおよび用語Ｂの文脈類似度、当該用語Ａおよび用語Ｂの文字列類似度、当該用語Ａの特徴度および用語Ｂの特徴度の４つの値の積が０より大きい場合、当該用語Ａおよび用語Ｂは類義語であると判定される。 Next, the synonym set extraction unit 37 calculates the context similarity of the terms A and B stored in the synonym set storage unit 27 for each set of terms A and B stored in the synonym set storage unit 27. It is determined whether or not the terms A and B are synonyms (a set of) based on the character string similarity of the terms A and B, the features of the term A, and the features of the term B. When the product of the four values of the context similarity of terms A and B, the string similarity of terms A and B, the features of terms A and B and the features of term B as described above is greater than 0, The terms A and B are determined to be synonyms.

ここで、図２１に示す類義語集合格納部２７に格納された例えば用語Ａ「鉛筆」および用語Ｂ「消しゴム」の場合、用語Ａ「鉛筆」および用語Ｂ「消しゴム」の文脈類似度は１であり、用語Ａ「鉛筆」および用語Ｂ「消しゴム」の文字列類似度は０であり、用語Ａ「鉛筆」の特徴度は０であり、用語Ｂ「消しゴム」の特徴度は０．５である。この場合、用語Ａ「鉛筆」および用語Ｂ「消しゴム」の文脈類似度、当該用語Ａ「鉛筆」および用語Ｂ「消しゴム」の文字列類似度、当該用語Ａ「鉛筆」の特徴度、および当該用語Ｂの特徴度の４つの値の積は０であるため、用語Ａ「鉛筆」および用語Ｂ「消しゴム」は類義語でないと判定される。 For example, in the case of the term A “pencil” and the term B “eraser” stored in the synonym set storage unit 27 illustrated in FIG. 21, the context similarity of the term A “pencil” and the term B “eraser” is 1. The term A “pencil” and the term B “eraser” have a character string similarity of 0, the term A “pencil” has a feature degree of 0, and the term B “eraser” has a feature degree of 0.5. In this case, the context similarity of the term A “pencil” and the term B “eraser”, the character string similarity of the term A “pencil” and the term B “eraser”, the feature of the term A “pencil”, and the term Since the product of the four values of the characteristic values of B is 0, it is determined that the term A “pencil” and the term B “eraser” are not synonyms.

一方、図２１に示す類義語集合格納部２７に格納された例えば用語Ａ「字消し」および用語Ｂ「消しゴム」の場合、用語Ａ「字消し」および用語Ｂ「消しゴム」の文脈類似度は０．７５であり、用語Ａ「字消し」および用語Ｂ「消しゴム」の文字列類似度は０．６７であり、用語Ａ「字消し」の特徴度は０．７５であり、用語Ｂ「消しゴム」の特徴度は０．５である。この場合、用語Ａ「字消し」および用語Ｂ「消しゴム」の文脈類似度、用語Ａ「字消し」および用語Ｂ「消しゴム」の文字列類似度、用語Ａ「字消し」の特徴度、および用語Ｂ「消しゴム」の特徴度の４つの値の積は０でないため、用語Ａ「字消し」および用語Ｂ「消しゴム」は類義語であると判定される。 On the other hand, in the case of the term A “eraser” and the term B “eraser” stored in the synonym set storage unit 27 shown in FIG. 21, the context similarity of the term A “eraser” and the term B “eraser” is 0. 75, the character string similarity of the term A “eraser” and the term B “eraser” is 0.67, the characteristic degree of the term A “eraser” is 0.75, and the term B “eraser” The characteristic degree is 0.5. In this case, the context similarity of terms A “eraser” and B “eraser”, the string similarity of terms A “eraser” and B “eraser”, the characteristics of term A “eraser”, and the terms Since the product of the four values of the characteristic values of B “eraser” is not 0, the terms A “eraser” and B “eraser” are determined to be synonyms.

このように、類義語集合抽出部３７は、図２１に示す類義語集合格納部２７に格納された用語Ａおよび用語Ｂの組の全てについて判定処理を実行する。 As described above, the synonym set extraction unit 37 executes the determination process for all of the sets of the term A and the term B stored in the synonym set storage unit 27 illustrated in FIG.

ここで、図２２は、上記した類義語集合抽出部３７による判定結果を示す。図２２に示す例では、類義語集合格納部２７に格納された用語Ａおよび用語Ｂの組のうち、用語Ａ「字消し」および用語Ｂ「消しゴム」のみが類義語であると判定されている。この場合、類義語集合抽出部３７は、用語Ａ「字消し」および用語Ｂ「消しゴム」を類義語として抽出する。 Here, FIG. 22 shows a determination result by the synonym set extraction unit 37 described above. In the example illustrated in FIG. 22, only the term A “eraser” and the term B “eraser” are determined to be synonyms from the set of the term A and the term B stored in the synonym set storage unit 27. In this case, the synonym set extraction unit 37 extracts the term A “eraser” and the term B “eraser” as synonyms.

ここではクラスタ２について説明したが、クラスタ１およびクラスタ３についても同様の処理が実行される。例えばクラスタ１の場合、図２３に示すように用語Ａおよび用語Ｂの特徴度が全て０であるため、類義語は抽出されない。一方、クラスタ３の場合には、図２４に示すように用語Ａ「取り消し」および用語Ｂ「取りやめ」と、用語Ａ「取りやめ」および用語Ｂ「取り消し」とが類義語であると判定されるため、用語Ａ「取り消し」および用語Ｂ「取りやめ」（および、用語Ａ「取りやめ」および用語Ｂ「取り消し」）が類義語として抽出される。 Although the cluster 2 has been described here, the same processing is executed for the cluster 1 and the cluster 3 as well. For example, in the case of cluster 1, the synonyms are not extracted because the features of terms A and B are all 0 as shown in FIG. On the other hand, in the case of the cluster 3, as shown in FIG. 24, the terms A “cancel” and the term B “cancel” and the terms A “cancel” and the term B “cancel” are determined to be synonyms. The terms A “cancel” and B “cancel” (and the terms A “cancel” and B “cancel”) are extracted as synonyms.

上記したように類義語集合抽出部３７によって類義語として抽出された２つの用語（用語Ａおよび用語Ｂ）は、出力処理部３８によって出力（表示）される。ここで、図２５は、類義語集合抽出部３７によって類義語として抽出された２つの用語が表示された場合における表示画面の一例を示す。図２５に示す例では、上記したようにクラスタ２の場合（つまり、クラスタ２に対して処理が実行された場合）に類義語として抽出された用語「字消し」および用語「消しゴム」とクラスタ３の場合（つまり、クラスタ３に対して処理が実行された場合）に類義語として抽出された用語「取り消し」および用語「取りやめ」とが表示されている。ユーザは、図２５に示すような表示画面を参照して、当該表示画面に表示された２つの用語を類義語として登録するか否かを指示することができる。 As described above, the two terms (term A and term B) extracted as synonyms by the synonym set extraction unit 37 are output (displayed) by the output processing unit 38. Here, FIG. 25 shows an example of a display screen when two terms extracted as synonyms by the synonym set extraction unit 37 are displayed. In the example shown in FIG. 25, the term “eraser” and the term “eraser” extracted as synonyms in the case of cluster 2 (that is, when processing is performed on cluster 2) and the cluster 3 as described above. The term “cancel” and the term “cancel” extracted as synonyms are displayed in this case (that is, when processing is performed on cluster 3). The user can instruct whether or not to register two terms displayed on the display screen as synonyms with reference to the display screen as shown in FIG.

上記したように本実施形態においては、文書データベース２２に格納されている複数の文書に含まれる用語の中から２つの用語（第１および第２の用語）を抽出し、当該抽出された２つの用語の類似度を算出し、文書データベース２２に格納されている複数の文書が属するクラスタを生成し、当該生成されたクラスタに対する当該２つの用語の各々の特徴度を算出し、当該２つの用語の類似度および当該クラスタに対する２つの用語の各々の特徴度に基づいて当該２つの用語を類義語として抽出する構成により、誤った用語を類義語として抽出することなく、適切な用語のみを類義語として文書から抽出することが可能となる。 As described above, in the present embodiment, two terms (first and second terms) are extracted from terms included in a plurality of documents stored in the document database 22, and the extracted two The degree of similarity of terms is calculated, a cluster to which a plurality of documents stored in the document database 22 belong is generated, the characteristic degree of each of the two terms with respect to the generated cluster is calculated, and the two terms By extracting the two terms as synonyms based on the similarity and the characteristics of each of the two terms for the cluster, only the appropriate terms are extracted as synonyms from the document without extracting the wrong terms as synonyms. It becomes possible to do.

具体的には、上述した図１４に示す例えば文脈類似度および文字列類似度のみに基づいて類義語が抽出される場合、当該文脈類似度および文字列類似度の２つの値の積が０でない２つの用語（ここでは、用語「取り消し」および用語「消しゴム」、用語「取り消し」および用語「取りやめ」、用語「字消し」および用語「消しゴム」）がそれぞれ類義語として抽出される。しかしながら、この場合には適切でない用語「取り消し」および用語「消しゴム」についても類義語として抽出される。これに対して本実施形態においては、文脈類似度および文字列類似度に加えて用語が抽出された文書が属するクラスタ（に対する特徴度）についても考慮することによって、適切でない用語「取り消し」および用語「消しゴム」を排除して、上述したように用語「取り消し」および用語「取りやめ」、用語「字消し」および用語「消しゴム」のみを類義語として抽出することができる。 Specifically, when a synonym is extracted based only on, for example, the context similarity and the character string similarity shown in FIG. 14 described above, the product of two values of the context similarity and the character string similarity is not 0 2 Two terms (here, the term “cancel” and the term “eraser”, the term “cancel” and the term “cancel”, the term “eraser” and the term “eraser”) are extracted as synonyms, respectively. However, the term “cancellation” and the term “eraser” which are not appropriate in this case are also extracted as synonyms. On the other hand, in the present embodiment, in addition to the context similarity and the character string similarity, the term “cancellation” and the term that are not appropriate are considered by taking into consideration the cluster (characteristic for) to which the document from which the term is extracted belongs. By excluding “eraser”, as described above, only the term “cancel” and term “cancel”, the term “eraser” and the term “eraser” can be extracted as synonyms.

つまり、本実施形態によれば、文脈類似度または文字列類似度という局所的な類似性とクラスタに対する特徴度という大局的な類似性に基づいて類義語が抽出されるため、誤った類義語（ノイズ用語）が文書内の局所的な類似性により抽出されることを抑制することができる。 That is, according to the present embodiment, synonyms are extracted based on local similarity such as context similarity or character string similarity and global similarity such as cluster feature. ) Can be suppressed from being extracted due to local similarity in the document.

また、本実施形態において抽出された類義語は、例えば類義語辞書等に登録しておき、文書検索または文書分類等において利用することができる。 Further, the synonyms extracted in the present embodiment can be registered in, for example, a synonym dictionary and used for document search or document classification.

なお、本実施形態においては、２つの用語の類似度として文脈類似度および文字列類似度が算出されるものとして説明したが、文脈類似度および文字列類似度の一方のみが算出される構成であってもよい。文脈類似度のみが算出される場合には、例えば図９に示すステップＳ１１〜Ｓ１７の処理が実行されればよい。一方、文字列類似度のみが算出される場合には、例えば用語集計結果格納部２４に格納された複数の用語のうちの２つの用語の組み合わせの各々について図９に示すステップＳ１８およびＳ１９の処理が実行されればよい。 In this embodiment, the context similarity and the character string similarity are calculated as the similarity between two terms. However, only one of the context similarity and the character string similarity is calculated. There may be. When only the context similarity is calculated, for example, steps S11 to S17 shown in FIG. 9 may be executed. On the other hand, when only the character string similarity is calculated, for example, the processing of steps S18 and S19 shown in FIG. 9 for each combination of two terms among a plurality of terms stored in the term aggregation result storage unit 24. May be executed.

（第２の実施形態）
次に、本発明の第２の実施形態について説明する。なお、本実施形態に係る文書処理装置の機能構成は、前述した第１の実施形態と同様であるため、適宜、図２を用いて説明する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described. The functional configuration of the document processing apparatus according to the present embodiment is the same as that of the first embodiment described above, and will be described with reference to FIG. 2 as appropriate.

本実施形態においては、特徴度算出部３６による特徴度算出処理が実行された後に類似度算出部３４による類似度算出処理が実行される点が、前述した第１の実施形態とは異なる。つまり、本実施形態における類似度算出処理においては、用語集計結果格納部２４に格納されている出現頻度情報（用語集計結果）ではなく、特徴度算出結果格納部２６に格納されている特徴度算出結果が用いられる。 This embodiment is different from the first embodiment described above in that the similarity calculation process by the similarity calculation unit 34 is executed after the feature calculation process by the feature calculation unit 36 is executed. That is, in the similarity calculation process in the present embodiment, the feature degree calculation stored in the feature degree calculation result storage unit 26 instead of the appearance frequency information (term aggregation result) stored in the term tabulation result storage unit 24. The result is used.

ここで、図２６のフローチャートを参照して、本実施形態に係る文書処理装置３０の処理手順について説明する。 Here, a processing procedure of the document processing apparatus 30 according to the present embodiment will be described with reference to a flowchart of FIG.

まず、前述した図４に示すステップＳ１〜Ｓ３、Ｓ５およびＳ６の処理に相当するステップＳ４１〜Ｓ４５の処理が実行される。 First, the processes of steps S41 to S45 corresponding to the processes of steps S1 to S3, S5, and S6 shown in FIG. 4 are executed.

なお、ステップＳ４２の処理が実行されると、前述した図５および図６に示すように係り受け関係情報（解析結果）が解析結果格納部２３に格納される。また、ステップＳ４３の処理が実行されると、前述した図７に示すように出現頻度情報（用語集計結果）が用語集計結果格納部２４に格納される。また、ステップＳ４５の処理が実行されると、前述した図１７に示すような特徴度算出結果が特徴度算出結果格納部２６に格納される。 When the process of step S42 is executed, the dependency relationship information (analysis result) is stored in the analysis result storage unit 23 as shown in FIGS. 5 and 6 described above. When the process of step S43 is executed, the appearance frequency information (term aggregation result) is stored in the term aggregation result storage unit 24 as shown in FIG. Further, when the process of step S45 is executed, the characteristic degree calculation result as shown in FIG. 17 is stored in the characteristic degree calculation result storage unit 26.

次に、類似度算出部３４は、解析結果格納部２３および特徴度算出結果格納部２５を参照して類似度算出処理を実行する（ステップＳ４６）。この類似度算出処理によれば、特徴度算出結果格納部２６に格納されたクラスタ生成部３５によって生成された各クラスタに対する特徴度が後述する予め定められた条件を満たす用語の中から２つの用語（第１および第２の用語）が抽出され、当該２つの用語の類似度が算出される。つまり、類似度算出処理においては、特徴度算出結果格納部２６に格納された全ての用語ではなく、予め定められた条件を満たす用語のみを対象として処理が実行される。なお、類似度算出処理の詳細については後述する。 Next, the similarity calculation unit 34 performs a similarity calculation process with reference to the analysis result storage unit 23 and the feature calculation result storage unit 25 (step S46). According to the similarity calculation process, two terms out of terms satisfying a predetermined condition described later in terms of the features for each cluster generated by the cluster generation unit 35 stored in the feature calculation result storage unit 26 are described. (First and second terms) are extracted, and the similarity between the two terms is calculated. That is, in the similarity calculation process, the process is executed only for terms that satisfy a predetermined condition instead of all the terms stored in the feature calculation result storage unit 26. Details of the similarity calculation process will be described later.

次に、類義語集合抽出部３７は、類似度算出結果格納部２５に格納された類似度に基づいて、上記した類似度算出処理において抽出された２つの用語を類義語（の集合）として抽出する処理（類義語集合抽出処理）を実行する（ステップＳ４７）。なお、類義語集合抽出処理の詳細については後述する。 Next, the synonym set extraction unit 37 extracts, based on the similarity stored in the similarity calculation result storage unit 25, the two terms extracted in the above-described similarity calculation process as a synonym (set). (Synonym set extraction processing) is executed (step S47). Details of the synonym set extraction process will be described later.

類義語集合抽出処理が実行されると、前述した図４に示すステップＳ８の処理に相当するステップＳ４８の処理が実行される。 When the synonym set extraction process is executed, the process of step S48 corresponding to the process of step S8 shown in FIG. 4 described above is executed.

次に、図２７のフローチャートを参照して、上述した類似度算出処理（上記した図２６に示すステップＳ４６の処理）の処理手順について説明する。 Next, the procedure of the similarity calculation process described above (the process of step S46 shown in FIG. 26 described above) will be described with reference to the flowchart of FIG.

類似度算出処理においては、クラスタ生成部３５によって生成されたクラスタの各々について以下のステップＳ５１〜Ｓ６１の処理を実行する。この処理の対象となるクラスタを対象クラスタと称する。 In the similarity calculation process, the following steps S51 to S61 are executed for each of the clusters generated by the cluster generation unit 35. A cluster to be processed is referred to as a target cluster.

まず、類似度算出部３４は、特徴度算出結果格納部２６から特徴度算出結果を読み込む（ステップＳ５１）。この場合、類似度算出部３４は、特徴度算出結果として、特徴度算出結果格納部２６に格納されている用語および当該用語の対象クラスタに対する特徴度を読み込む。 First, the similarity calculation unit 34 reads a feature calculation result from the feature calculation result storage unit 26 (step S51). In this case, the similarity calculation unit 34 reads the terms stored in the feature calculation result storage unit 26 and the features for the target cluster of the terms as the feature calculation results.

次に、類似度算出部３４は、特徴度算出結果格納部２６から読み込まれた用語のうち予め定められた条件を満たす用語を抽出する（ステップＳ５２）。ここで、予め定められた条件は、例えば対象クラスタに対する特徴度が０でないことを含む。この場合、類似度算出部３４は、特徴度算出結果格納部２６から読み込まれた用語および当該用語の対象クラスタに対する特徴度に基づいて、当該対象クラスタに対する特徴度が０でない用語を抽出する。 Next, the similarity calculation unit 34 extracts terms satisfying a predetermined condition from the terms read from the feature calculation result storage unit 26 (step S52). Here, the predetermined condition includes, for example, that the feature degree with respect to the target cluster is not zero. In this case, based on the terms read from the feature value calculation result storage unit 26 and the feature values of the terms with respect to the target cluster, the similarity calculation unit 34 extracts terms whose feature values for the target cluster are not zero.

類似度算出部３４は、前述した図９に示すステップＳ１１と同様に、解析結果格納部２３に格納されている全ての係り受け関係情報を読み込む（ステップＳ５３）。 Similarity calculation part 34 reads all the dependency relation information stored in analysis result storage part 23 like Step S11 shown in Drawing 9 mentioned above (Step S53).

次に、類似度算出部３４は、解析結果格納部２３から読み込まれた係り受け関係情報の各々に含まれる用語２および関係を１つの組として、当該読み込まれた係り受け関係情報において当該用語２および関係の組に対応づけられている用語１（つまり、同一の用語２と同一の係り受け関係にある用語１）の異なり数をカウントする（ステップＳ５４）。なお、このステップＳ５４においては、上記したステップＳ５２において抽出された用語（つまり、対象クラスタに対する特徴度が０でない用語）のみが対象とされる。 Next, the similarity calculation unit 34 sets the term 2 and the relationship included in each of the dependency relationship information read from the analysis result storage unit 23 as one set, and uses the term 2 in the read dependency relationship information. And the number of different terms 1 associated with the set of relationships (that is, the term 1 in the same dependency relationship as the same term 2) is counted (step S54). In this step S54, only the terms extracted in the above-described step S52 (that is, terms whose feature degrees for the target cluster are not 0) are targeted.

以下、前述した図９に示すステップＳ１３〜Ｓ１９の処理に相当するステップＳ５５〜Ｓ６１の処理が実行される。なお、ステップＳ５９において算出された文脈類似度およびステップＳ６１において算出された文字列類似度は、前述した第１の実施形態と同様に類似度算出結果格納部２５に格納される。 Thereafter, the processes of steps S55 to S61 corresponding to the processes of steps S13 to S19 shown in FIG. 9 are executed. Note that the context similarity calculated in step S59 and the character string similarity calculated in step S61 are stored in the similarity calculation result storage unit 25 as in the first embodiment.

ステップＳ６１の処理が実行されると、クラスタ生成部３５によって生成された全てのクラスタについて上記したステップＳ５１〜Ｓ６１の処理が実行されたか否かを判定する（ステップＳ６２）。 When the process of step S61 is executed, it is determined whether or not the processes of steps S51 to S61 described above have been executed for all clusters generated by the cluster generation unit 35 (step S62).

全てのクラスタについて処理が実行されていないと判定された場合（ステップＳ６２のＮＯ）、上記したステップＳ５１に戻って処理が繰り返される。この場合、ステップＳ５１〜Ｓ６１の処理が実行されていないクラスタを対象クラスタとして処理が実行される。 When it is determined that the processing has not been executed for all the clusters (NO in step S62), the process returns to the above-described step S51 and is repeated. In this case, the processing is executed with the cluster for which the processing of steps S51 to S61 has not been executed as the target cluster.

一方、全てのクラスタについて処理が実行されたと判定された場合（ステップＳ６２のＮＯ）、類似度算出処理は終了される。 On the other hand, when it is determined that the process has been executed for all the clusters (NO in step S62), the similarity calculation process ends.

上記したように類似度算出処理が実行されると、類似度算出結果格納部２５には、クラスタ生成部３５によって生成されたクラスタ毎の類似度算出結果（文脈類似度および文字列類似度）が格納される。 When the similarity calculation processing is executed as described above, the similarity calculation result storage unit 25 stores the similarity calculation results (context similarity and character string similarity) for each cluster generated by the cluster generation unit 35. Stored.

以下、図２８および図２９を参照して、上記した類似度算出処理について具体的に説明する。ここでは、解析結果格納部２３には、前述した図５および図６に示す複数の係り受け関係情報が格納されているものとする。また、前述した図８において説明したようにクラスタ１〜３がクラスタ生成部３５によって生成されたものとする。また、特徴度算出結果格納部２６には、前述した図１７に示す各種情報が格納されているものとする。 Hereinafter, with reference to FIG. 28 and FIG. 29, the above-described similarity calculation processing will be specifically described. Here, it is assumed that the analysis result storage unit 23 stores a plurality of pieces of dependency relationship information shown in FIGS. 5 and 6 described above. Further, it is assumed that the clusters 1 to 3 are generated by the cluster generation unit 35 as described in FIG. Further, it is assumed that the characteristic degree calculation result storage unit 26 stores various types of information shown in FIG.

なお、上記したように類似度算出処理においては、クラスタ生成部３５によって生成されたクラスタ１〜３の各々について処理が実行される。ここでは、クラスタ２について具体的に説明する。 As described above, in the similarity calculation process, the process is executed for each of the clusters 1 to 3 generated by the cluster generation unit 35. Here, the cluster 2 will be specifically described.

まず、類似度算出部３４は、特徴度算出結果格納部２６に格納されている用語毎に、当該用語および当該用語のクラスタ２に対する特徴度を、当該特徴度算出結果格納部２６から読み出す。図１７に示す例では、類似度算出部３４は、用語「鉛筆」および当該用語「鉛筆」のクラスタ２に対する特徴度「０」を読み出す。なお、他の用語についても同様である。 First, for each term stored in the feature calculation result storage unit 26, the similarity calculation unit 34 reads out the term and the feature level of the term with respect to the cluster 2 from the feature calculation result storage unit 26. In the example illustrated in FIG. 17, the similarity calculation unit 34 reads the term “pencil” and the characteristic degree “0” of the term “pencil” with respect to the cluster 2. The same applies to other terms.

次に、類似度算出部３４は、特徴度算出結果格納部２６から読み込まれた用語のうち例えばクラスタ２に対する特徴度が０でない用語を抽出する。図１７を用いて具体的に説明すると、図１７に示す特徴度算出結果格納部２６に格納されている用語（つまり、読み込まれた用語）のうち、用語「消しゴム」、「専用インク」、「文字」および「字消し」以外の用語のクラスタ２に対する特徴度は０である。したがって、類似度算出部３４は、クラスタ２に対する特徴度が０でない用語として用語「消しゴム」、「専用インク」、「文字」および「字消し」を抽出する。 Next, the similarity calculation unit 34 extracts, for example, terms whose feature degrees for the cluster 2 are not 0 from the terms read from the feature degree calculation result storage unit 26. Specifically, using terms (namely, read terms) stored in the feature calculation result storage unit 26 shown in FIG. 17, the terms “eraser”, “dedicated ink”, “ The degree of feature for the cluster 2 of terms other than “character” and “eraser” is zero. Therefore, the similarity calculation unit 34 extracts the terms “eraser”, “dedicated ink”, “character”, and “eraser” as terms whose feature degrees for the cluster 2 are not zero.

類似度算出部３４は、解析結果格納部２３に格納されている全ての係り受け関係情報を読み込む。 The similarity calculation unit 34 reads all the dependency relationship information stored in the analysis result storage unit 23.

類似度算出部３４は、読み込まれた係り受け関係情報の各々に含まれる用語２および関係の組毎に、当該組に対応づけられている用語の異なり数をカウントする。つまり、類似度算出部３４は、読み込まれた係り受け関係情報に基づいて、同一の用語２と同一の係り受け関係にある用語１の異なり数をカウントする。 The similarity calculation unit 34 counts the number of different terms associated with each set of terms 2 and relationships included in each of the read dependency relationship information. That is, the similarity calculation unit 34 counts the number of different terms 1 having the same dependency relationship as the same term 2 based on the read dependency relationship information.

このとき、類似度算出部３４は、上記したように抽出された用語「消しゴム」、「専用インク」、「文字」および「字消し」のみを対象（つまり、用語１）として、同一の用語２と同一の係り受け関係にある用語１の異なり数をカウントする。 At this time, the similarity calculation unit 34 targets only the terms “eraser”, “dedicated ink”, “character”, and “eraser” extracted as described above, that is, the same term 2 Count the number of different terms 1 that have the same dependency relationship.

ここで、図５および図６を用いて具体的に説明すると、例えば用語２「消す」および関係「で」の組に対応づけられている用語１は、「字消し」および「消しゴム」である。このため、解析結果格納部２３に格納されている係り受け関係情報において用語２「消す」および関係「で」の組に対応づけられている用語１（つまり、用語２「消す」と「で」の係り受け関係にある用語１）の異なり数は２となる。 5 and FIG. 6, for example, the term 1 associated with the pair of the term 2 “erasing” and the relationship “de” is “erasing” and “erasing”. . For this reason, in the dependency relationship information stored in the analysis result storage unit 23, the term 1 associated with the pair of the term 2 “erase” and the relationship “de” (that is, the term 2 “erase” and “de”). The number of different terms 1) in the dependency relationship is 2.

なお、図５および図６に示す例では用語２「持参」および関係「を」の組に対応づけられている用語１は「鉛筆」、「消しゴム」および「時計」であるが、上記したように用語「消しゴム」、「専用インク」、「文字」および「字消し」のみが対象とされるため、当該用語２「持参」および関係「を」の組に対応づけられている用語１の異なり数は１となる。 In the example shown in FIGS. 5 and 6, the term 1 associated with the set of term 2 “bringing” and the relationship “ha” is “pencil”, “eraser”, and “clock”. Since only the terms “eraser”, “dedicated ink”, “character” and “eraser” are targeted, the difference between term 2 associated with term 2 “bring” and relationship “a” The number is 1.

次に、類似度算出部３４は、カウントされた異なり数が２以上である場合における用語２および関係の組に対応づけられている各用語１の出現頻度をカウントする。 Next, the similarity calculation unit 34 counts the appearance frequency of each term 1 associated with the term 2 and the set of relationships when the counted difference number is 2 or more.

例えば上記した用語２「消す」および関係「で」の組に対応づけられている用語１の異なり数は２以上であるため、類似度算出部３４は、用語２「消す」および関係「で」の組に対応づけられている用語１「字消し」および「消しゴム」の各々の出現頻度をカウントする。図５および図６に示す例によれば、用語２「消す」および関係「で」の組に対応づけられている用語１「字消し」の出現頻度は２であり、用語１「消しゴム」の出現頻度は１である。なお、図５および図６に示す例では、用語２「消す」および関係「で」の組に対応づけられている用語１以外に異なり数が２以上である用語１は存在しない。 For example, since the difference number of the term 1 associated with the pair of the term 2 “erase” and the relationship “de” is two or more, the similarity calculation unit 34 determines the term 2 “erase” and the relationship “de”. The frequency of appearance of each of the terms 1 “eraser” and “eraser” associated with the set of is counted. According to the example shown in FIG. 5 and FIG. 6, the appearance frequency of the term 1 “erasure” associated with the set of term 2 “erasure” and the relationship “de” is 2, and the term 1 “eraser” The appearance frequency is 1. In the example shown in FIGS. 5 and 6, there is no term 1 having a number of two or more different from the term 1 associated with the term 2 “erasing” and the relationship “de”.

次に、類似度算出部３４は、中間処理結果情報を生成する。この場合、類似度算出部３４は、図２８に示す中間処理結果情報を生成する。中間処理結果情報の生成処理については、前述した第１の実施形態において説明した通りであるため、その詳しい説明を省略する。 Next, the similarity calculation unit 34 generates intermediate processing result information. In this case, the similarity calculation unit 34 generates intermediate processing result information shown in FIG. Since the generation processing of the intermediate processing result information is as described in the first embodiment, detailed description thereof is omitted.

次に、類似度算出部３４は、図２８に示す中間処理結果情報に基づいて、２つの用語１からなる全ての組み合わせ（順列）を抽出する。この場合、類似度算出部３４は、「字消し」および「消しゴム」、「消しゴム」および「字消し」の２つの組み合わせを抽出する。 Next, the similarity calculation unit 34 extracts all combinations (permutations) consisting of the two terms 1 based on the intermediate processing result information shown in FIG. In this case, the similarity calculation unit 34 extracts two combinations of “eraser” and “eraser”, “eraser” and “eraser”.

このように類似度算出部３４によって抽出された用語１の各組み合わせは、用語Ａおよび用語Ｂ（の組）として類似度算出結果格納部２５に格納される。 Thus, each combination of the term 1 extracted by the similarity calculation unit 34 is stored in the similarity calculation result storage unit 25 as a term A and a term B (a set).

以下、詳しい説明は省略するが、前述した第１の実施形態と同様に、類似度算出結果格納部２５に格納された用語Ａおよび用語Ｂの各組について、文脈類似度および文字列類似度が算出される。このように算出された文脈類似度および文字列類似度は、類似度算出結果格納部２５に格納される。 Hereinafter, although detailed description is omitted, as in the first embodiment described above, the context similarity and the character string similarity are determined for each set of terms A and B stored in the similarity calculation result storage unit 25. Calculated. The context similarity and the character string similarity calculated in this way are stored in the similarity calculation result storage unit 25.

なお、ここではクラスタ２について処理が実行された場合について説明したが、クラスタ１および３についても同様であるため、その詳しい説明を省略する。 In addition, although the case where the process was performed about the cluster 2 was demonstrated here, since it is the same also about the clusters 1 and 3, the detailed description is abbreviate | omitted.

なお、図２９は、クラスタ１〜３の各々について処理が実行された後の類似度算出結果格納部２５のデータ構造の一例を示す。 FIG. 29 shows an example of the data structure of the similarity calculation result storage unit 25 after the processing is executed for each of the clusters 1 to 3.

例えば図２９に示す組ＩＤ「１」および「２」（に関する情報）は、上記したようにクラスタ２について処理が実行された場合に類似度算出結果格納部２５に格納される情報（クラスタ２の類似度算出結果）であり、組ＩＤ「３」および「４」（に関する情報）は、詳しい説明は省略するが、クラスタ３について処理が実行された場合に類似度算出結果格納部２５に格納される情報（クラスタ３の類似度算出結果）である。 For example, the group IDs “1” and “2” (information related to) shown in FIG. 29 are information stored in the similarity calculation result storage unit 25 when the process is executed on the cluster 2 as described above (the information on the cluster 2). Although the detailed description is omitted, the set IDs “3” and “4” (information regarding the similarity calculation result) are stored in the similarity calculation result storage unit 25 when the process is performed on the cluster 3. Information (cluster 3 similarity calculation result).

次に、図３０のフローチャートを参照して、上述した類義語集合抽出処理（上記した図２６に示すステップＳ４７の処理）の処理手順について説明する。 Next, a processing procedure of the above-described synonym set extraction processing (the processing of step S47 shown in FIG. 26 described above) will be described with reference to the flowchart of FIG.

まず、類義語集合抽出部３７は、類似度算出結果格納部２５から類似度算出結果を読み込む（ステップＳ７１）。この場合、類義語集合抽出部３７は、類似度算出結果として、類似度算出結果格納部２５に格納されている用語Ａおよび用語Ｂの組、当該用語Ａおよび用語Ｂの組に対応づけられている文脈類似度（用語Ａおよび用語Ｂの文脈類似度）および文字列類似度（用語Ａおよび用語Ｂの文字列類似度）を読み込む。なお、類似度算出結果格納部２５から読み込まれた類似度算出結果は、類義語集合格納部２７に格納される。 First, the synonym set extraction unit 37 reads the similarity calculation result from the similarity calculation result storage unit 25 (step S71). In this case, the synonym set extraction unit 37 is associated with the set of terms A and B and the set of terms A and B stored in the similarity calculation result storage unit 25 as the similarity calculation result. The context similarity (context similarity of terms A and B) and the character string similarity (character string similarity of terms A and B) are read. The similarity calculation result read from the similarity calculation result storage unit 25 is stored in the synonym set storage unit 27.

次に、類義語集合抽出部３７は、類義語集合格納部２７に格納された用語Ａおよび用語Ｂの組毎に、当該類義語集合格納部２７に格納された用語Ａおよび用語Ｂの文脈類似度、および用語Ａおよび用語Ｂの文字列類似度に基づいて、当該用語Ａおよび用語Ｂが類義語（の集合）であるか否かを判定する。この場合、例えば用語Ａおよび用語Ｂの文脈類似度、および用語Ａおよび用語Ｂの文字列類似度の２つの値の積が０より大きい場合（つまり、当該文脈類似度および文字列類似度の両方が０でない場合）、当該用語Ａおよび用語Ｂは類義語であると判定される。 Next, the synonym set extraction unit 37 calculates the context similarity between the terms A and B stored in the synonym set storage unit 27 for each set of terms A and B stored in the synonym set storage unit 27, and Based on the string similarity of terms A and B, it is determined whether or not terms A and B are synonyms (a set of). In this case, for example, when the product of two values of the context similarity of terms A and B and the string similarity of terms A and B is greater than 0 (that is, both the context similarity and the string similarity) The term A and the term B are determined to be synonyms.

このような判定処理が類義語集合格納部２７に格納された用語Ａおよび用語Ｂの全ての組に対して実行されることによって、類義語集合抽出部３７は、用語Ａおよび用語Ｂを類義語として抽出する（ステップＳ７２）。 By executing such determination processing for all pairs of the term A and the term B stored in the synonym set storage unit 27, the synonym set extraction unit 37 extracts the term A and the term B as synonyms. (Step S72).

例えば類似度算出結果格納部２５に前述した図２９に示す各種情報が格納されているものとすると、上記した類義語集合抽出処理によれば、用語Ａ「字消し」および用語Ｂ「消しゴム」が類義語として抽出される。また、用語Ａ「取り消し」および用語Ｂ「取りやめ」（および、用語Ａ「取りやめ」および用語Ｂ「取り消し」）が類義語として抽出される。 For example, assuming that the various kinds of information shown in FIG. 29 described above are stored in the similarity calculation result storage unit 25, the term A “eraser” and the term B “eraser” are synonyms according to the synonym set extraction process described above. Extracted as In addition, the term A “cancel” and the term B “cancel” (and the term A “cancel” and the term B “cancel”) are extracted as synonyms.

したがって、本実施形態においても、前述した第１の実施形態と同様の類義語が抽出される。 Therefore, also in the present embodiment, synonyms similar to those in the first embodiment described above are extracted.

上記したように本実施形態においては、文書データベース２２に格納されている複数の文書が属するクラスタを生成し、文書データベースに格納されている複数の文書に含まれる用語毎に当該クラスタに対する特徴度を算出し、当該特徴度が予め定められた条件を満たす用語の中から２つの用語（第１および第２の用語）を抽出し、当該抽出された２つの用語の類似度を算出し、当該算出された類似度に基づいて当該２つの用語を類義語として抽出する構成により、前述した第１の実施形態と同様に誤った用語を類義語として抽出することなく、適切な用語のみを類義語として文書から抽出することができ、更に、類似度算出処理の対象となる用語を絞り込むことが可能になるため、当該類似度算出処理における処理量（計算量）を軽減することが可能となる。 As described above, in the present embodiment, a cluster to which a plurality of documents stored in the document database 22 belongs is generated, and a characteristic degree for the cluster is obtained for each term included in the plurality of documents stored in the document database. Calculate, extract two terms (first and second terms) from terms that satisfy the predetermined condition, calculate the similarity between the two extracted terms, and calculate With the configuration in which the two terms are extracted as synonyms based on the similarity, the appropriate terms are extracted from the document as synonyms without extracting the erroneous terms as synonyms as in the first embodiment. In addition, since it becomes possible to narrow down the terms that are subject to similarity calculation processing, the processing amount (calculation amount) in the similarity calculation processing is reduced. Theft is possible.

以上説明した少なくとも１つの実施形態によれば、類義語として適切な用語を文書から抽出することが可能な文書処理装置およびプログラムを提供することができる。 According to at least one embodiment described above, it is possible to provide a document processing apparatus and program capable of extracting an appropriate term as a synonym from a document.

なお、本願発明は、上記各実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記各実施形態に開示されている複数の構成要素の適宜な組合せにより種々の発明を形成できる。例えば、各実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態に亘る構成要素を適宜組合せてもよい。 Note that the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in each embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１０…コンピュータ、２０…外部記憶装置、２２…文書データベース、２３…解析結果格納部、２４…用語集計結果格納部、２５…類似度算出結果格納部、２６…特徴度算出結果格納部、２７…類義語集合格納部、３０…文書処理装置、３１…入力処理部、３２…解析部、３３…用語集計部、３４…類似度算出部、３５…クラスタ生成部、３６…特徴度算出部、３７…類義語集合抽出部、３８…出力処理部、３４１…文脈類似度算出部、３４２…文字列類似度算出部。 DESCRIPTION OF SYMBOLS 10 ... Computer, 20 ... External storage device, 22 ... Document database, 23 ... Analysis result storage part, 24 ... Term totaling result storage part, 25 ... Similarity calculation result storage part, 26 ... Feature degree calculation result storage part, 27 ... Synonym set storage unit, 30 ... document processing device, 31 ... input processing unit, 32 ... analysis unit, 33 ... term aggregation unit, 34 ... similarity calculation unit, 35 ... cluster generation unit, 36 ... feature calculation unit, 37 ... Synonym set extraction unit, 38 ... output processing unit, 341 ... context similarity calculation unit, 342 ... character string similarity calculation unit.

Claims

Document storage means for storing a plurality of documents including a term consisting of one or more words;
Term extracting means for extracting first and second terms from terms contained in a plurality of documents stored in the document storing means;
Similarity calculating means for calculating the similarity between the extracted first and second terms;
Cluster generation means for generating a cluster to which each of the plurality of documents stored in the document storage means belongs;
Based on the appearance frequency of the extracted first term in the plurality of documents stored in the document storage means and the document belonging to the generated cluster, the feature degree of the first term for the cluster is calculated, A feature of calculating a feature level of the second term for the cluster based on the appearance frequency of the extracted second term in the plurality of documents stored in the document storage unit and the document belonging to the generated cluster. Degree calculation means;
Based on the similarity calculated by the similarity calculation unit, the feature level of the first term and the feature level of the second term calculated by the feature level calculation unit, the first and second terms are synonymous. And a synonym extracting means for extracting as: a document processing apparatus.

The term extraction means extracts first and second terms that are in the same dependency relationship with the same term among the terms contained in the plurality of documents stored in the document storage means,
The similarity calculation means includes a frequency of appearance of a first term having the same dependency relationship as the same term in a plurality of documents stored in the document storage means, and a plurality of similarity degrees stored in the document storage means. The document processing apparatus according to claim 1, wherein the similarity between the extracted first and second terms is calculated based on the appearance frequency of the first term in the document.

2. The document according to claim 1, wherein the similarity calculation means calculates the similarity of the first and second terms based on the number of characters common to the extracted first and second terms. Processing equipment.

Document storage means for storing a plurality of documents including a term consisting of one or more words;
Cluster generation means for generating a cluster to which a plurality of documents stored in the document storage means belong;
For each term contained in a plurality of documents stored in the document storage means, based on the appearance frequency of the terms in the plurality of documents stored in the document storage means and the documents belonging to the generated cluster, A feature calculation means for calculating the feature of the term for the cluster;
Term extraction means for extracting the first and second terms from terms satisfying a predetermined degree of feature calculated by the feature calculation means;
Similarity calculating means for calculating the similarity between the extracted first and second terms;
And a synonym extracting unit that extracts the extracted first and second terms as synonyms based on the similarity calculated by the similarity calculating unit.

In a document processing apparatus comprising an external storage device having a document storage means for storing a plurality of documents including a term composed of one or a plurality of words and a computer using the external storage device, the document processing device is executed by the computer. A program
In the computer,
Extracting first and second terms from terms contained in a plurality of documents stored in the document storage means;
Calculating a similarity between the extracted first and second terms;
Generating a cluster to which each of a plurality of documents stored in the document storage means belongs;
Based on the appearance frequency of the extracted first term in the plurality of documents stored in the document storage means and the document belonging to the generated cluster, the feature degree of the first term for the cluster is calculated, Calculating a characteristic value of the second term for the cluster based on the appearance frequency of the extracted second term in the plurality of documents stored in the document storage means and the document belonging to the generated cluster When,
Extracting the first and second terms as synonyms based on the calculated similarity, the calculated feature of the first term, and the feature of the second term. program.