JP6764973B1

JP6764973B1 - Related word dictionary creation system, related word dictionary creation method and related word dictionary creation program

Info

Publication number: JP6764973B1
Application number: JP2019083861A
Authority: JP
Inventors: 貴之山泉; 敦史大熊; 秀正前川
Original assignee: Mizuho Information and Research Institute Inc
Current assignee: Mizuho Information and Research Institute Inc
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2020-10-07
Anticipated expiration: 2039-04-25
Also published as: JP2020181367A

Abstract

【課題】効率的に関連語辞書を作成するための関連語辞書作成システム、関連語辞書作成方法及び関連語辞書作成プログラムを提供する。【解決手段】管理システム２０は、複数の文書からなる文書セットにおいて関連語組を特定する制御部２１と、関連語組を記録する辞書記憶部２４とを備える。制御部２１が、文書セットを構成する複数の文書から、サンプリング文書数の文書で構成したサブ文書群を生成し、前記各サブ文書群において、前記サブ文書群に含まれる単語を用いて単語組を生成し、前記単語組が出現する文書数に応じて出現率を算出し、前記各サブ文書群において、前記出現率に応じて特定した各単語組を関連語組として辞書記憶部２４に記録する。【選択図】図１PROBLEM TO BE SOLVED: To provide a related word dictionary creation system, a related word dictionary creation method and a related word dictionary creation program for efficiently creating a related word dictionary. A management system 20 includes a control unit 21 that identifies a related word set in a document set composed of a plurality of documents, and a dictionary storage unit 24 that records the related word set. The control unit 21 generates a sub-document group composed of documents having a sampled number of documents from a plurality of documents constituting the document set, and in each of the sub-document groups, a word set is used using the words included in the sub-document group. Is generated, the appearance rate is calculated according to the number of documents in which the word set appears, and in each of the sub-document groups, each word set specified according to the appearance rate is recorded in the dictionary storage unit 24 as a related word set. To do. [Selection diagram] Fig. 1

Description

本発明は、関連性がある複数の単語を登録した辞書を作成するための関連語辞書作成システム、関連語辞書作成方法及び関連語辞書作成プログラムに関する。 The present invention relates to a related word dictionary creation system for creating a dictionary in which a plurality of related words are registered, a related word dictionary creation method, and a related word dictionary creation program.

インターネットやデータベースにおけるテキスト検索では、ユーザが入力したキーワードを用いて検索を行なう。ここで、満足のいく検索結果が得られない場合には、他のキーワードを用いて検索を繰り返すことがある。この場合、関連性がある単語を記憶した関連語辞書を利用する場合もある。また、検索時に、ユーザが入力したキーワードに対して、ユーザが興味を持ちそうな関連語を提示し、この関連語による検索の機会を与えることも可能である。このような関連語についての辞書を作成するための技術も検討されている（例えば、特許文献１，２）。 In text search on the Internet and databases, the search is performed using keywords entered by the user. Here, if a satisfactory search result cannot be obtained, the search may be repeated using other keywords. In this case, a related word dictionary that stores related words may be used. It is also possible to present a related word that the user is likely to be interested in for the keyword input by the user at the time of searching, and to give an opportunity to search by this related word. Techniques for creating dictionaries for such related words are also being studied (for example, Patent Documents 1 and 2).

特許文献１に記載された技術では、２つの単語とそれら単語間の関係を表す関係名とを示す関係項目データを複数含んだ概念辞書データを参照し、キーワードが含まれる関係項目データから関係名とキーワードとは異なる単語との組を抽出する。さらに、概念辞書データを参照し、概念辞書データ内の単語を関連語候補とすると、各関連語候補について、関連語候補が含まれる関係項目データから関係名と関連語候補とは異なる単語との組を抽出する。そして、関連語候補について抽出した関係名及び単語の組の中に、キーワードについて抽出したいずれかの関係名及び単語の組に一致する組がある場合、その関連語候補を関連語として出力する。 In the technique described in Patent Document 1, the concept dictionary data including a plurality of relation item data indicating two words and the relation name indicating the relationship between the two words is referred to, and the relation name is selected from the relation item data including the keyword. And extract pairs of words that are different from the keywords. Further, when the concept dictionary data is referred to and the words in the concept dictionary data are set as related word candidates, for each related word candidate, the relation name and the word different from the related word candidate are obtained from the relation item data including the related word candidate. Extract pairs. Then, if there is a set that matches any of the relation name and word sets extracted for the keyword among the relation name and word sets extracted for the related word candidate, the related word candidate is output as the related word.

また、特許文献２に記載された技術では、メタデータが付された二つ一組の画像を入力し、二つ一組でサーバに入力された画像のそれぞれに付されたメタデータの組合せ毎に、メタデータ共起頻度テーブルの共起頻度に「１」を加算する。そして、メタデータ共起頻度テーブルを参照しながら、共起頻度に基づく式を用いて、単語同士の関連スコアを算出する。スコア算出部で算出された関連スコアをもって関連語辞書テーブルを更新する。 Further, in the technique described in Patent Document 2, two sets of images with metadata are input, and each combination of metadata attached to each of the two sets of images input to the server. Add "1" to the co-occurrence frequency of the metadata co-occurrence frequency table. Then, while referring to the metadata co-occurrence frequency table, the association score between words is calculated using the formula based on the co-occurrence frequency. The related word dictionary table is updated with the related score calculated by the score calculation unit.

特開２０１５−１３０１１１号公報JP-A-2015-130111 特開２００９−２６６０６５号公報JP-A-2009-266065

上記の先行技術のように、共起分析により、複数の単語群からなる組み合わせを生成して、関連語を特定することも可能である。しかしながら、関連語を増やすために、大量の文書に含まれる単語を用いて共起分析を行なう場合、文書中の単語数のべき乗で、システム負荷が大きくなる。また、大半の単語の組み合わせも出現頻度が少ない。一方、出現頻度を、文書の部分毎に算出することも可能であるが、これでは、文書全体を反映させた評価ができない場合がある。更に、全単語の組み合わせを生成した後で、出現頻度を計算していたのでは、システム負荷が大きくなる。 As in the above prior art, it is also possible to generate a combination consisting of a plurality of word groups by co-occurrence analysis and identify related words. However, when co-occurrence analysis is performed using words contained in a large number of documents in order to increase the number of related words, the system load increases due to the power of the number of words in the document. Also, most word combinations are less frequent. On the other hand, it is possible to calculate the frequency of appearance for each part of the document, but this may not allow evaluation that reflects the entire document. Furthermore, if the frequency of occurrence is calculated after the combination of all words is generated, the system load becomes large.

上記課題を解決する関連語辞書作成システムは、複数の文書からなる文書セットにおいて関連語組を特定する制御部と、関連語組を記録する辞書記憶部とを備える。そして、前記制御部が、前記文書セットから、サンプリング文書数の文書を抽出して、複数のサブ文書群を生成し、前記各サブ文書群において、前記サブ文書群に含まれる単語を用いて単語組を生成し、前記単語組が出現する文書数に応じて出現率を算出し、前記各サブ文書群において、前記出現率に応じて特定した各単語組を関連語組として前記辞書記憶部に記録する。 A related word dictionary creation system that solves the above problems includes a control unit that identifies a related word set in a document set composed of a plurality of documents, and a dictionary storage unit that records the related word set. Then, the control unit extracts a document having a sampled number of documents from the document set to generate a plurality of sub-document groups, and in each of the sub-document groups, a word is used by using a word included in the sub-document group. A set is generated, the appearance rate is calculated according to the number of documents in which the word set appears, and in each of the sub-document groups, each word set specified according to the appearance rate is set as a related word set in the dictionary storage unit. Record.

本発明によれば、効率的に関連語辞書を作成することができる。 According to the present invention, a related word dictionary can be efficiently created.

本実施形態のシステム概略図。The system schematic diagram of this embodiment. 本実施形態のハードウェア構成の説明図。The explanatory view of the hardware configuration of this embodiment. 本実施形態の処理手順の説明図。The explanatory view of the processing procedure of this embodiment. 本実施形態の処理手順の説明図。The explanatory view of the processing procedure of this embodiment. 本実施形態の処理手順の説明図であって、（ａ）は文書全体を用いて関連語組を特定する手順、（ｂ）はサブ文書群に分けて関連語組を特定する手順の説明図。An explanatory diagram of the processing procedure of the present embodiment, (a) is an explanatory diagram of a procedure for specifying a related word set using the entire document, and (b) is an explanatory diagram of a procedure for specifying a related word set by dividing into sub-document groups. ..

以下、図１〜図５を用いて、関連語辞書作成システム、関連語辞書作成方法及び関連語辞書作成プログラムの一実施形態を説明する。
まず、図５を用いて、関連語辞書作成方法の概念を説明する。
図５（ａ）に示すように、共起分析により、膨大な文書セットＤ１に含まれる単語を組み合わせた単語組ＷＳ１１を生成し、この単語組ＷＳ１１を用いて、関連語組ＷＳ１２を特定することを目的とする。この共起分析においては、任意の文や文章に、単語組ＷＳ１１において、ある単語とある単語とが同時に出現する頻度が高い組み合わせにより、関連語組ＷＳ１２を生成する。ここで、文書セットＤ１のサイズが大きい場合、膨大な単語組ＷＳ１１が生成されるため、データ分析の負荷が大きくなる。 Hereinafter, an embodiment of a related word dictionary creation system, a related word dictionary creation method, and a related word dictionary creation program will be described with reference to FIGS. 1 to 5.
First, the concept of a related word dictionary creation method will be described with reference to FIG.
As shown in FIG. 5A, a co-occurrence analysis is used to generate a word set WS11 that combines words contained in a huge document set D1, and this word set WS11 is used to identify a related word set WS12. With the goal. In this co-occurrence analysis, a related word set WS12 is generated by a combination in which a word and a word frequently appear at the same time in the word set WS11 in an arbitrary sentence or sentence. Here, when the size of the document set D1 is large, a huge amount of word sets WS11 are generated, which increases the load of data analysis.

そこで、図５（ｂ）に示すように、文書セットＤ１をサブ文書群ＳＤ１に分けて、サブ文書群ＳＤ１毎に共起分析を行なうことにより、各サブ文書群ＳＤ１において単語組ＷＳ２１を生成する。そして、各サブ文書群ＳＤ１の単語組ＷＳ２１において、出現率（出現指標値）が高い関連語組ＷＳ２２を特定する。この場合、単語組ＷＳ１１から生成した関連語組ＷＳ１２と、単語組ＷＳ２１から生成した関連語組ＷＳ２２とが、実質的に一致するように、関連語辞書の作成を行なう。
図１に示すように、この関連語辞書の作成のために管理システム２０を用いる。 Therefore, as shown in FIG. 5B, the document set D1 is divided into sub-document groups SD1 and co-occurrence analysis is performed for each sub-document group SD1 to generate a word set WS21 in each sub-document group SD1. .. Then, in the word set WS21 of each sub-document group SD1, the related word set WS22 having a high appearance rate (appearance index value) is specified. In this case, the related word dictionary is created so that the related word set WS12 generated from the word set WS11 and the related word set WS22 generated from the word set WS21 substantially match.
As shown in FIG. 1, a management system 20 is used to create this related word dictionary.

（ハードウェア構成の説明）
図２を用いて、管理システム２０を構成する情報処理装置Ｈ１０のハードウェア構成を説明する。情報処理装置Ｈ１０は、通信装置Ｈ１１、入力装置Ｈ１２、表示装置Ｈ１３、記憶部Ｈ１４、プロセッサＨ１５を備える。なお、このハードウェア構成は一例であり、他のハードウェアにより実現することも可能である。 (Explanation of hardware configuration)
The hardware configuration of the information processing apparatus H10 constituting the management system 20 will be described with reference to FIG. The information processing device H10 includes a communication device H11, an input device H12, a display device H13, a storage unit H14, and a processor H15. This hardware configuration is an example, and can be realized by other hardware.

通信装置Ｈ１１は、他の装置との間で通信経路を確立して、データの送受信を実行するインタフェースであり、例えばネットワークインタフェースカードや無線インタフェース等である。 The communication device H11 is an interface that establishes a communication path with another device and executes data transmission / reception, such as a network interface card or a wireless interface.

入力装置Ｈ１２は、利用者等からの入力を受け付ける装置であり、例えばマウスやキーボード等である。表示装置Ｈ１３は、各種情報を表示するディスプレイ等である。
記憶部Ｈ１４は、管理システム２０の各種機能を実行するためのデータや各種プログラムを格納する記憶装置である。記憶部Ｈ１４の一例としては、ＲＯＭ、ＲＡＭ、ハードディスク等がある。 The input device H12 is a device that receives input from a user or the like, such as a mouse or a keyboard. The display device H13 is a display or the like that displays various information.
The storage unit H14 is a storage device that stores data and various programs for executing various functions of the management system 20. An example of the storage unit H14 is a ROM, RAM, hard disk, or the like.

プロセッサＨ１５は、記憶部Ｈ１４に記憶されるプログラムやデータを用いて、管理システム２０における各処理を制御する。プロセッサＨ１５の一例としては、例えばＣＰＵやＭＰＵ等がある。このプロセッサＨ１５は、ＲＯＭ等に記憶されるプログラムをＲＡＭに展開して、各サービスのための各種プロセスを実行する。 The processor H15 controls each process in the management system 20 by using the programs and data stored in the storage unit H14. Examples of the processor H15 include a CPU, an MPU, and the like. The processor H15 expands a program stored in ROM or the like into RAM and executes various processes for each service.

プロセッサＨ１５は、自身が実行するすべての処理についてソフトウェア処理を行なうものに限られない。例えば、プロセッサＨ１５は、自身が実行する処理の少なくとも一部についてハードウェア処理を行なう専用のハードウェア回路（例えば、特定用途向け集積回路：ＡＳＩＣ）を備えてもよい。すなわち、プロセッサＨ１５は、（１）コンピュータプログラム（ソフトウェア）に従って動作する１つ以上のプロセッサ、（２）各種処理のうち少なくとも一部の処理を実行する１つ以上の専用のハードウェア回路、或いは（３）それらの組み合わせ、を含む回路（circuitry）として構成し得る。プロセッサは、ＣＰＵ並びに、ＲＡＭ及びＲＯＭ等のメモリを含み、メモリは、処理をＣＰＵに実行させるように構成されたプログラムコード又は指令を格納している。メモリすなわちコンピュータ可読媒体は、汎用又は専用のコンピュータでアクセスできるあらゆる利用可能な媒体を含む。 The processor H15 is not limited to one that performs software processing for all the processing executed by itself. For example, the processor H15 may include a dedicated hardware circuit (for example, an integrated circuit for a specific application: ASIC) that performs hardware processing for at least a part of the processing executed by the processor H15. That is, the processor H15 is (1) one or more processors that operate according to a computer program (software), (2) one or more dedicated hardware circuits that execute at least a part of various processes, or ( 3) It can be configured as a circuitry including a combination thereof. The processor includes a CPU and memories such as RAM and ROM, and the memory stores a program code or a command configured to cause the CPU to execute a process. Memory or computer readable media includes any available medium accessible by a general purpose or dedicated computer.

（システム構成）
次に、図１を用いて、管理システム２０のシステム構成を説明する。
管理システム２０は、関連語辞書を作成するためのコンピュータである。この管理システム２０は、制御部２１、文書セット記憶部２２、サブ文書記憶部２３、辞書記憶部２４、ワークメモリ２５を備える。 (System configuration)
Next, the system configuration of the management system 20 will be described with reference to FIG.
The management system 20 is a computer for creating a related word dictionary. The management system 20 includes a control unit 21, a document set storage unit 22, a sub-document storage unit 23, a dictionary storage unit 24, and a work memory 25.

制御部２１は、制御手段（ＣＰＵ、ＲＡＭ、ＲＯＭ等）を備え、後述する処理（前処理段階、インデックス生成段階、共起分析段階、データ分析段階、出力処理段階等の各処理等）を行なう。そのための関連語辞書作成プログラムを実行することにより、制御部２１は、前処理部２１１、インデックス生成部２１２、共起分析部２１３、データ分析部２１４、出力処理部２１５として機能する。 The control unit 21 is provided with control means (CPU, RAM, ROM, etc.) and performs processes described later (each process such as preprocessing step, index generation step, co-occurrence analysis step, data analysis step, output processing step, etc.) .. By executing the related word dictionary creation program for that purpose, the control unit 21 functions as a preprocessing unit 211, an index generation unit 212, a co-occurrence analysis unit 213, a data analysis unit 214, and an output processing unit 215.

前処理部２１１は、文書に含まれる不要な情報を削除する処理を実行する。このために、前処理部２１１は、不要な情報を削除するための前処理フィルタを備えている。この前処理部２１１は、初期設定された単語組数ｍ0、単語組の出現率の許容誤差α、サンプリング文書数Ｎを記憶する。ここで、単語組数ｍ0は、辞書に登録する関連語組数である。単語組の出現率の許容誤差αは、サンプリング文書を用いた処理によって算出された単語組が関連語組として抽出されるために必要な出現率の下限値と、サブ文書を用いて生成した単語組のサブ文書における出願率のずれの許容値である。サンプリング文書数Ｎは、サブ文書に含める文書数である。 The preprocessing unit 211 executes a process of deleting unnecessary information included in the document. For this purpose, the preprocessing unit 211 includes a preprocessing filter for deleting unnecessary information. The preprocessing unit 211 stores the initially set number of word sets m0, the tolerance α of the appearance rate of word sets, and the number of sampling documents N. Here, the number of word sets m0 is the number of related word sets registered in the dictionary. The tolerance α of the appearance rate of the word set is the lower limit of the appearance rate required for the word set calculated by the processing using the sampling document to be extracted as the related word set, and the word generated by using the sub-document. Tolerance of deviation in application rate in a set of sub-documents. The number of sampled documents N is the number of documents included in the sub-document.

インデックス生成部２１２は、形態素分析により、文書に含まれる単語を品詞毎に取得する処理を実行する。更に、インデックス生成部２１２は、特定の品詞（ここでは、名詞）のみを抽出するための品詞フィルタを備えている。 The index generation unit 212 executes a process of acquiring words included in a document for each part of speech by morphological analysis. Further, the index generation unit 212 includes a part of speech filter for extracting only a specific part of speech (here, a noun).

共起分析部２１３は、共起分析により、文や文章に含まれる単語組を生成する処理を実行する。
データ分析部２１４は、共起分析により生成した単語組の出現率に応じて、関連語組を特定し、辞書記憶部２４に登録する処理を実行する。
出力処理部２１５は、辞書記憶部２４に登録された関連語組を用いて生成したグラフを表示する処理を実行する。本実施形態では、関連語をノードとして、他の関連語とリンクさせたグラフを生成する。 The co-occurrence analysis unit 213 executes a process of generating a word set included in a sentence or a sentence by the co-occurrence analysis.
The data analysis unit 214 identifies related word sets according to the appearance rate of the word sets generated by the co-occurrence analysis, and executes a process of registering them in the dictionary storage unit 24.
The output processing unit 215 executes a process of displaying a graph generated by using the related word set registered in the dictionary storage unit 24. In the present embodiment, a graph linked with other related words is generated with the related words as nodes.

文書セット記憶部２２には、辞書を作成するために用いる文書セットが記録される。この文書セットは、関連語辞書を作成する前に記録される。この文書セットは、複数の単語により構成された文書（記事）からなる。例えば、文書セットとして事典を用いる場合、見出し語に対応した文書（記事）により構成される。 The document set storage unit 22 records a document set used for creating a dictionary. This set of documents is recorded before creating the related word dictionary. This document set consists of documents (articles) composed of a plurality of words. For example, when an encyclopedia is used as a document set, it is composed of documents (articles) corresponding to headwords.

サブ文書記憶部２３には、関連語組の生成に用いるサブ文書が記録される。このサブ文書は、文書セットから、関連語組を生成するために用いる文書を抽出した場合に記録される。このサブ文書は、サンプリング文書数の文書により構成される。 The sub-document storage unit 23 records a sub-document used to generate a related word set. This sub-document is recorded when the document used to generate the related word set is extracted from the document set. This sub-document consists of a number of samples.

辞書記憶部２４には、生成した関連語組レコードが記録される。この関連語組レコードは辞書作成処理を実行した場合に登録される。この関連語組レコードには、文書セットにおいて出現率が高い単語組が記録される。 The generated related word set record is recorded in the dictionary storage unit 24. This related word set record is registered when the dictionary creation process is executed. In this related word set record, word sets having a high occurrence rate in the document set are recorded.

ワークメモリ２５は、共起分析部２１３、データ分析部２１４が各処理を行なう場合に各種データを仮記憶するために用いる。
（辞書作成処理）
次に、図３を用いて、辞書作成処理を説明する。
まず、管理システム２０の制御部２１は、辞書作成のための文書セットの取得処理を実行する（ステップＳ１−１）。具体的には、制御部２１の前処理部２１１は、関連語辞書を作成する文書セットを取得し、文書セット記憶部２２に記録する。例えば、事典データベースに記録された複数の記事（文書セット）を取得し、文書セット記憶部２２に記録する。 The work memory 25 is used to temporarily store various data when the co-occurrence analysis unit 213 and the data analysis unit 214 perform each process.
(Dictionary creation process)
Next, the dictionary creation process will be described with reference to FIG.
First, the control unit 21 of the management system 20 executes a document set acquisition process for creating a dictionary (step S1-1). Specifically, the preprocessing unit 211 of the control unit 21 acquires a document set for creating a related word dictionary and records it in the document set storage unit 22. For example, a plurality of articles (document sets) recorded in the encyclopedia database are acquired and recorded in the document set storage unit 22.

次に、管理システム２０の制御部２１は、不要部分の削除処理を実行する（ステップＳ１−２）。具体的には、制御部２１の前処理部２１１は、前処理フィルタを用いて、文書セットの中に含まれる不要な情報を削除する。例えば、文書セット中に設定されているタグ等を削除する。 Next, the control unit 21 of the management system 20 executes the deletion process of the unnecessary portion (step S1-2). Specifically, the preprocessing unit 211 of the control unit 21 deletes unnecessary information included in the document set by using the preprocessing filter. For example, delete tags and the like set in the document set.

次に、管理システム２０の制御部２１は、サブ文書群の抽出処理を実行する（ステップＳ１−３）。具体的には、制御部２１の前処理部２１１は、文書セットを構成する文書を、所定のサイズに分ける。ここで、前処理部２１１は、関連語辞書作成方法において統計学的に有意な数（サンプリング文書数Ｎ）を決定する。次に、前処理部２１１は、決定したサンプリング文書数の文書を、文書セットからランダムに抽出して、各サブ文書群を生成し、サブ文書記憶部２３に記録する（第１サンプリング処理）。この第１サンプリング処理については、図４を用いて後述する。 Next, the control unit 21 of the management system 20 executes the extraction process of the sub-document group (step S1-3). Specifically, the preprocessing unit 211 of the control unit 21 divides the documents constituting the document set into predetermined sizes. Here, the preprocessing unit 211 determines a statistically significant number (sampling document number N) in the related word dictionary creation method. Next, the preprocessing unit 211 randomly extracts documents with a determined number of sampled documents from the document set, generates each sub-document group, and records the documents in the sub-document storage unit 23 (first sampling process). This first sampling process will be described later with reference to FIG.

次に、管理システム２０の制御部２１は、サブ文書記憶部２３に記録されたサブ文書群毎に以下の処理を繰り返す。
ここでは、管理システム２０の制御部２１は、形態素分析により名詞の抽出処理を実行する（ステップＳ１−４）。具体的には、制御部２１のインデックス生成部２１２は、サブ文書群に含まれる単語において名詞を特定する。 Next, the control unit 21 of the management system 20 repeats the following processing for each sub-document group recorded in the sub-document storage unit 23.
Here, the control unit 21 of the management system 20 executes a noun extraction process by morphological analysis (step S1-4). Specifically, the index generation unit 212 of the control unit 21 identifies a noun in a word included in the sub-document group.

次に、管理システム２０の制御部２１は、単語組の作成処理を実行する（ステップＳ１−５）。具体的には、制御部２１の共起分析部２１３は、抽出した名詞を用いて、共起分析により単語組を生成し、ワークメモリ２５に仮記録する。ここでは、名詞の中で、同じ文に含まれる単語組を生成する。 Next, the control unit 21 of the management system 20 executes the word set creation process (step S1-5). Specifically, the co-occurrence analysis unit 213 of the control unit 21 generates a word set by co-occurrence analysis using the extracted nouns, and temporarily records it in the work memory 25. Here, in a noun, a word set included in the same sentence is generated.

次に、管理システム２０の制御部２１は、単語組の出現率の算出処理を実行する（ステップＳ１−６）。具体的には、制御部２１の共起分析部２１３は、ワークメモリ２５に仮記録された各単語組の出現数をカウントする。そして、共起分析部２１３は、各単語組の出現数を、全単語組の出現数で除算することにより、各単語組の出現率を算出する。 Next, the control unit 21 of the management system 20 executes the calculation process of the appearance rate of the word set (step S1-6). Specifically, the co-occurrence analysis unit 213 of the control unit 21 counts the number of occurrences of each word set provisionally recorded in the work memory 25. Then, the co-occurrence analysis unit 213 calculates the appearance rate of each word set by dividing the number of appearances of each word set by the number of appearances of all the word sets.

次に、管理システム２０の制御部２１は、出現率に応じて単語組の抽出処理を実行する（ステップＳ１−７）。具体的には、制御部２１のデータ分析部２１４は、出現率が高い順番に単語組を関連語組として抽出する（第２サンプリング）。そして、データ分析部２１４は、抽出した関連語組を出現率に関連付けて、ワークメモリ２５に仮記憶する。
以上の処理を、すべてのサブ文書群について終了するまで繰り返す。 Next, the control unit 21 of the management system 20 executes the word set extraction process according to the appearance rate (step S1-7). Specifically, the data analysis unit 214 of the control unit 21 extracts word sets as related word sets in descending order of appearance rate (second sampling). Then, the data analysis unit 214 associates the extracted related word set with the appearance rate and temporarily stores it in the work memory 25.
The above process is repeated until all subdocuments are completed.

次に、管理システム２０の制御部２１は、関連語組の集計処理を実行する（ステップＳ１−８）。具体的には、制御部２１のデータ分析部２１４は、ワークメモリ２５に仮記憶したすべての関連語組の中で、同じ関連語組が複数登録されている場合には、最も高い出現率に関連付けられた関連語組のみを残す。次に、データ分析部２１４は、出現率に応じて、関連語組を並び替える。そして、出現率が高い順番に、単語組数ｍ0の関連語組を特定して、辞書記憶部２４に記録する。 Next, the control unit 21 of the management system 20 executes the aggregation process of related word sets (step S1-8). Specifically, the data analysis unit 214 of the control unit 21 has the highest appearance rate when a plurality of the same related word sets are registered among all the related word sets temporarily stored in the work memory 25. Leave only the associated related wordset. Next, the data analysis unit 214 rearranges the related word sets according to the appearance rate. Then, the related word sets having the number of word sets m0 are specified in descending order of the appearance rate and recorded in the dictionary storage unit 24.

次に、管理システム２０の制御部２１は、グラフ作成処理を実行する（ステップＳ１−９）。具体的には、制御部２１の出力処理部２１５は、辞書記憶部２４に記録された関連語組を構成する各関連語をノードとして、関連語にリンクさせたグラフを生成する。この場合、一つの単語Ａが複数の関連語組に登録されている場合には、この単語Ａを中心として他の関連語にリンクを生成する。 Next, the control unit 21 of the management system 20 executes the graph creation process (step S1-9). Specifically, the output processing unit 215 of the control unit 21 generates a graph linked to the related words by using each related word constituting the related word set recorded in the dictionary storage unit 24 as a node. In this case, when one word A is registered in a plurality of related word sets, a link is generated to another related word centering on this word A.

（第１サンプリング処理）
次に、図４を用いて、第１サンプリング処理を説明する。
ここでは、管理システム２０の制御部２１は、変数の設定処理を実行する（ステップＳ２−１）。具体的には、制御部２１の前処理部２１１は、初期設定された単語組数ｍ0、単語組の出現率の許容誤差α、サンプリング文書数Ｎを取得する。 (First sampling process)
Next, the first sampling process will be described with reference to FIG.
Here, the control unit 21 of the management system 20 executes the variable setting process (step S2-1). Specifically, the preprocessing unit 211 of the control unit 21 acquires the preset number of word sets m0, the tolerance α of the appearance rate of the word sets, and the number of sampling documents N.

次に、管理システム２０の制御部２１は、サンプリング文書数の文書の抽出処理を実行する（ステップＳ２−２）。ここで、文書セット記憶部２２には、総文書数Ｎ0の文書からなる文書セットが記録されている場合を想定する。そして、制御部２１の前処理部２１１は、この総文書数Ｎ0の文書の中から、サンプリング文書数Ｎの文書を抽出して、文書群を生成する。この文書群は、サンプリング文書数Ｎを決定するための文書群であるため、サンプル文書群と呼ぶ。 Next, the control unit 21 of the management system 20 executes a document extraction process for the number of sampled documents (step S2-2). Here, it is assumed that the document set storage unit 22 records a document set composed of documents having a total number of documents N0. Then, the preprocessing unit 211 of the control unit 21 extracts a document having a sampling document number N from the documents having a total document number N0, and generates a document group. This document group is called a sample document group because it is a document group for determining the number of sampled documents N.

次に、管理システム２０の制御部２１は、ステップＳ１−４と同様に、形態素分析により名詞の抽出処理を実行する（ステップＳ２−３）。
次に、管理システム２０の制御部２１は、サンプル文書群を用いて単語組の生成処理を実行する（ステップＳ２−４）。具体的には、制御部２１の前処理部２１１は、サブ文書群において、単語組を生成する。 Next, the control unit 21 of the management system 20 executes a noun extraction process by morphological analysis in the same manner as in steps S1-4 (step S2-3).
Next, the control unit 21 of the management system 20 executes a word set generation process using the sample document group (step S2-4). Specifically, the preprocessing unit 211 of the control unit 21 generates a word set in the sub-document group.

次に、管理システム２０の制御部２１は、生成した単語組において、順次、処理対象（単語組ｒ）を特定し、以下の処理を繰り返す。
ここでは、管理システム２０の制御部２１は、各単語組が含まれる文書数φの算出処理を実行する（ステップＳ２−５）。具体的には、制御部２１の前処理部２１１は、サンプル文書群において、処理対象の単語組が含まれる文書の出現数φrをカウントする。 Next, the control unit 21 of the management system 20 sequentially identifies the processing target (word set r) in the generated word set, and repeats the following processing.
Here, the control unit 21 of the management system 20 executes a calculation process of the number of documents φ including each word set (step S2-5). Specifically, the preprocessing unit 211 of the control unit 21 counts the number of occurrences φr of the document including the word set to be processed in the sample document group.

次に、管理システム２０の制御部２１は、各単語組の出現率の算出処理を実行する（ステップＳ２−６）。具体的には、制御部２１の前処理部２１１は、単語組ｒについて、文書の出現数φrをサンプリング文書数Ｎで除算することにより、出現率ｐ（ｒk）を算出する。そして、前処理部２１１は、処理対象の単語組に関連付けて出現率をワークメモリ２５に仮記録する。
以上の処理を、すべての単語組について繰り返す。 Next, the control unit 21 of the management system 20 executes the calculation process of the appearance rate of each word set (step S2-6). Specifically, the preprocessing unit 211 of the control unit 21 calculates the appearance rate p (rk) by dividing the number of document appearances φr by the number of sampled documents N for the word set r. Then, the preprocessing unit 211 temporarily records the appearance rate in the work memory 25 in association with the word set to be processed.
The above process is repeated for all word sets.

次に、管理システム２０の制御部２１は、出現率による各単語組のソート処理を実行する（ステップＳ２−７）。具体的には、制御部２１の前処理部２１１は、ワークメモリ２５に仮記録された単語組を、出現率が高い順番に並び替える。 Next, the control unit 21 of the management system 20 executes the sorting process of each word set according to the appearance rate (step S2-7). Specifically, the preprocessing unit 211 of the control unit 21 rearranges the word sets provisionally recorded in the work memory 25 in descending order of appearance rate.

次に、管理システム２０の制御部２１は、サンプル文書群から抽出する単語組数の算出処理を実行する（ステップＳ２−８）。具体的には、制御部２１の前処理部２１１は、下記式１を用いて、サンプル文書群から抽出する単語組数ｋを算出する。 Next, the control unit 21 of the management system 20 executes a calculation process of the number of word sets extracted from the sample document group (step S2-8). Specifically, the preprocessing unit 211 of the control unit 21 calculates the number of word sets k extracted from the sample document group using the following equation 1.

次に、管理システム２０の制御部２１は、ｋ番目の単語組の出現率の特定処理を実行する（ステップＳ２−９）。具体的には、制御部２１の前処理部２１１は、ワークメモリ２５において出現率が高い順番に並び替えた単語組において、ｋ番目の単語組の出現率を特定する。 Next, the control unit 21 of the management system 20 executes the process of specifying the appearance rate of the k-th word set (step S2-9). Specifically, the preprocessing unit 211 of the control unit 21 specifies the appearance rate of the kth word set in the word sets sorted in descending order of appearance rate in the work memory 25.

次に、管理システム２０の制御部２１は、サンプリング文書数は妥当かどうかについての判定処理を実行する（ステップＳ２−１０）。具体的には、制御部２１の前処理部２１１は、出現率ｐ（rk）、単語組の出現率の許容誤差α、総文書数Ｎ0、サンプリング文書数Ｎを用いて妥当性を判定する。ここで、下記式２が成立する場合には、サンプリング文書数Ｎは妥当と判定する。 Next, the control unit 21 of the management system 20 executes a determination process as to whether or not the number of sampled documents is appropriate (step S2-10). Specifically, the preprocessing unit 211 of the control unit 21 determines the validity using the appearance rate p (rk), the tolerance α of the appearance rate of the word set, the total number of documents N0, and the number of sampled documents N. Here, when the following equation 2 holds, it is determined that the number of sampling documents N is appropriate.

この式２は、以下の計算から導出できる。まず、単語組ｒkの真の出現率を真出現率tｐ（rk）、前処理部２１１で算出した単語組ｒkの出現率を出現率ｐ（rk）と表わす。
ここで、真出現率tｐ（rk）は不明であるが、真出現率tｐ（rk）と出現率ｐ（rk）とが等しいと仮定してサンプリングの対象とならなかった文書群（ΔＮ＝Ｎ0−Ｎ）に対して、単語組ｒkの個数と、単語組ｒkがその個数となる確率の分布は厳密には二項分布Bin（ΔＮ，ｐ（rk））に従う。ここで、ΔＮが大きい場合、以下の正規分布で近似できる。 This equation 2 can be derived from the following calculation. First, the true appearance rate of the word set rk is expressed as the true appearance rate tp (rk), and the appearance rate of the word set rk calculated by the preprocessing unit 211 is expressed as the appearance rate p (rk).
Here, although the true appearance rate tp (rk) is unknown, a group of documents (ΔN = N0) that were not sampled on the assumption that the true appearance rate tp (rk) and the appearance rate p (rk) are equal. Strictly speaking, the distribution of the number of word sets rk and the probability that the word set rk is the number follows the binomial distribution Bin (ΔN, p (rk)) with respect to −N). Here, when ΔN is large, it can be approximated by the following normal distribution.

具体的には、真出現率tｐ（ｒk）と出現率ｐ（ｒk）との差の絶対値が、単語組の出現率の許容誤差α未満となる確率が統計学的に見てサンプリングが有効とみなせる確率（９５％）となる場合、単語組の出現率の許容誤差αが標準偏差の２倍未満となり、下記式３が成立する。 Specifically, sampling is effective when the probability that the absolute value of the difference between the true appearance rate tp (rk) and the appearance rate p (rk) is less than the tolerance α of the appearance rate of the word set is statistically effective. When the probability (95%) is satisfied, the tolerance α of the appearance rate of the word set is less than twice the standard deviation, and the following equation 3 holds.

サンプリング文書数は妥当でないと判定した場合（ステップＳ２−１０において「ＮＯ」の場合）、管理システム２０の制御部２１は、サンプリング文書数の変更処理を実行する（ステップＳ２−１１）。具体的には、制御部２１の前処理部２１１は、下記式４を用いて、サンプリング文書数を変更する。 When it is determined that the number of sampled documents is not appropriate (when “NO” in step S2-10), the control unit 21 of the management system 20 executes a process of changing the number of sampled documents (step S2-11). Specifically, the preprocessing unit 211 of the control unit 21 changes the number of sampled documents by using the following equation 4.

ここで、［ｘ］はｘを超えない最大の整数を表す。
そして、管理システム２０の制御部２１は、ステップＳ２−２の処理に戻る。
一方、サンプリング文書数は妥当と判定した場合（ステップＳ２−１０において「ＹＥＳ」の場合）、管理システム２０の制御部２１は、文書セットの分割処理を実行する（ステップＳ２−１２）。具体的には、制御部２１の前処理部２１１は、文書セットから、妥当と判定したサンプリング文書数Ｎの文書をランダムに抽出して、複数のサブ文書群を生成する。 Here, [x] represents the maximum integer that does not exceed x.
Then, the control unit 21 of the management system 20 returns to the process of step S2-2.
On the other hand, when it is determined that the number of sampled documents is appropriate (when "YES" in step S2-10), the control unit 21 of the management system 20 executes the document set division process (step S2-12). Specifically, the preprocessing unit 211 of the control unit 21 randomly extracts a document having a number of sampled documents N determined to be valid from the document set, and generates a plurality of sub-document groups.

以上、本実施形態によれば、以下のような効果を得ることができる。
（１）本実施形態によれば、管理システム２０の制御部２１は、不要部分の削除処理を実行する（ステップＳ１−２）。これにより、文書セットに含まれる不要情報に基づいて、関連語辞書に対するノイズの混入を防止できる。 As described above, according to the present embodiment, the following effects can be obtained.
(1) According to the present embodiment, the control unit 21 of the management system 20 executes the deletion process of the unnecessary portion (step S1-2). This makes it possible to prevent noise from being mixed into the related word dictionary based on unnecessary information contained in the document set.

（２）本実施形態によれば、管理システム２０の制御部２１は、サブ文書群の抽出処理を実行する（ステップＳ１−３）。これにより、文書セットを、統計学的に有意な数の文書で構成したサブ文書に分けることにより、膨大な文書セットを小分けにして処理することができる。そして、１回の処理に必要なメモリ容量を削減することができる。 (2) According to the present embodiment, the control unit 21 of the management system 20 executes the extraction process of the sub-document group (step S1-3). As a result, a huge document set can be subdivided and processed by dividing the document set into sub-documents composed of a statistically significant number of documents. Then, the memory capacity required for one process can be reduced.

（３）本実施形態によれば、形態素分析により名詞の抽出処理（ステップＳ１−４）、単語組の作成処理（ステップＳ１−５）を実行する。これにより、サブ文書群において、関連語組の候補となる単語組を生成することができる。 (3) According to the present embodiment, the noun extraction process (step S1-4) and the word set creation process (step S1-5) are executed by morphological analysis. As a result, in the sub-document group, a word set that is a candidate for a related word set can be generated.

（４）本実施形態によれば、管理システム２０の制御部２１は、単語組の出現率の算出処理（ステップＳ１−６）、出現率に応じて単語組の抽出処理（ステップＳ１−７）、関連語組の集計処理（ステップＳ１−８）を実行する。これにより、サブ文書群において、出現率に応じて、関連語辞書に用いる単語組を抽出することができる。 (4) According to the present embodiment, the control unit 21 of the management system 20 calculates the appearance rate of the word set (step S1-6), and extracts the word set according to the appearance rate (step S1-7). , The aggregation process of related word sets (step S1-8) is executed. Thereby, in the sub-document group, the word set used for the related word dictionary can be extracted according to the appearance rate.

（５）本実施形態によれば、管理システム２０の制御部２１は、グラフ作成処理を実行する（ステップＳ１−９）。これにより、相互に関連する単語の関連性を視覚的にグラフ表示することができる。 (5) According to the present embodiment, the control unit 21 of the management system 20 executes the graph creation process (step S1-9). This makes it possible to visually display the relationships between words that are related to each other in a graph.

（６）本実施形態によれば、管理システム２０の制御部２１は、サンプリング文書群から抽出する単語組数の算出処理（ステップＳ２−８）、ｋ番目の単語組の出現率の特定処理（ステップＳ２−９）、サンプリング文書数は妥当かどうかについての判定処理（ステップＳ２−１０）を実行する。これにより、サンプリング文書数の妥当性を判定することができる。例えば、サンプリング文書数が小さすぎる場合には、サブ文書群において文書セットを代表する単語組を抽出することができない。一方、サンプリング文書数が大きすぎる場合には、サブ文書群において文書セットを代表する単語組を抽出することができるが、処理に必要なメモリ容量が大きくなる。そして、妥当性判定により、文書セットの特徴を反映させたサブ文書群を生成することができる。 (6) According to the present embodiment, the control unit 21 of the management system 20 calculates the number of word sets to be extracted from the sampled document group (step S2-8), and specifies the appearance rate of the kth word set (step S2-8). Step S2-9), a determination process (step S2-10) for determining whether the number of sampled documents is appropriate is executed. Thereby, the validity of the number of sampled documents can be judged. For example, if the number of sampled documents is too small, the word set representing the document set cannot be extracted in the sub-document group. On the other hand, when the number of sampled documents is too large, the word set representing the document set can be extracted in the sub-document group, but the memory capacity required for processing becomes large. Then, by validating, a sub-document group that reflects the characteristics of the document set can be generated.

（７）本実施形態によれば、サンプリング文書数は妥当でないと判定した場合（ステップＳ２−１０において「ＮＯ」の場合）、管理システム２０の制御部２１は、サンプリング文書数の変更処理を実行する（ステップＳ２−１１）。これにより、サンプリング文書数を調整することができる。 (7) According to the present embodiment, when it is determined that the number of sampled documents is not appropriate (when “NO” in step S2-10), the control unit 21 of the management system 20 executes a process of changing the number of sampled documents. (Step S2-11). As a result, the number of sampled documents can be adjusted.

本実施形態は、以下のように変更して実施することができる。本実施形態及び以下の変更例は、技術的に矛盾しない範囲で互いに組み合わせて実施することができる。
・上記実施形態では、事典データベースに記録された複数の記事を用いて関連語辞書を作成する。関連語辞書の作成対象は、記事に限定されるものではない。例えば、インターネットで公開されている情報を用いることができる。
・上記実施形態では、特定の品詞として、名詞を用いて関連語組を生成するが、品詞は名詞に限定されるものではない。例えば、形容詞同士のように、名詞以外の品詞の組み合わせや、名詞と動詞のように、異なる品詞との組み合わせ等、任意の品詞を組み合わせて関連語組を生成することができる。 This embodiment can be modified and implemented as follows. The present embodiment and the following modified examples can be implemented in combination with each other within a technically consistent range.
-In the above embodiment, a related word dictionary is created using a plurality of articles recorded in the encyclopedia database. The target of creating a related word dictionary is not limited to articles. For example, information published on the Internet can be used.
-In the above embodiment, a related word set is generated using a noun as a specific part of speech, but the part of speech is not limited to a noun. For example, it is possible to generate a related word set by combining arbitrary part of speech such as a combination of part of speech other than a noun such as adjectives and a combination of different part of speech such as a noun and a verb.

・上記実施形態では、第１サンプリング処理において、サンプリング文書数を決定する。サンプリング文書数の決定方法はこれに限定されるものではない。例えば、ワークメモリ２５の大きさに基づいて、サンプリング文書数の初期値を変更してもよい。 -In the above embodiment, the number of sampling documents is determined in the first sampling process. The method for determining the number of sampled documents is not limited to this. For example, the initial value of the number of sampled documents may be changed based on the size of the work memory 25.

１個の文書の単語数の分布は、総文書数Ｎ0のオーダーの計算量で調べることができる。そこで、まず、各文書に含まれる単語について、平均の単語組数ｍ0及び標準偏差σmを計算する。また、サンプリングの対象となった文書群に対して１個の文書あたりの平均単語数ｍを計算する。 The distribution of the number of words in one document can be examined by the amount of calculation on the order of the total number of documents N0. Therefore, first, for the words included in each document, the average number of word sets m0 and the standard deviation σm are calculated. In addition, the average number of words m per document is calculated for the document group to be sampled.

そして、下記条件を満たす場合、適切なサンプリングが行なわれていると判定できる。 Then, when the following conditions are satisfied, it can be determined that appropriate sampling is performed.

また、文書属性に基づいて、サンプリング文書数の初期値を変更するようにしてもよい。この場合には、文書属性として、文書セットに含まれる単語の長さ分布を用いてもよい。この場合、下記式５を用いて、Ｎの上限を決める。 Further, the initial value of the number of sampled documents may be changed based on the document attributes. In this case, the length distribution of words included in the document set may be used as the document attribute. In this case, the upper limit of N is determined by using the following equation 5.

・上記実施形態では、管理システム２０の制御部２１は、形態素分析により名詞の抽出処理を実行する（ステップＳ１−４）。ここで、関連語辞書を更新する場合には、過去の計算結果を残して、新たな文書を追加した場合に利用するようにしてもよい。この場合には、辞書作成処理において算出した出現率の中で、辞書に関連語組として登録した単語組について、最も低い出現率（基準確率）を、管理システム２０に記憶させておく。そして、関連語辞書の作成後、サンプリング文書数Ｎになるまで文書を蓄積する。蓄積された文書群について、辞書への追加処理を行なう場合、文書群に対して共起分析を行ない、各単語組の出現率を算出する。この出現率が、基準確率以上であれば、関連語組として追加登録する。これにより、最初から辞書作成処理を再実行する全面改訂の場合よりも、計算負荷を軽減することができる。 -In the above embodiment, the control unit 21 of the management system 20 executes a noun extraction process by morphological analysis (step S1-4). Here, when updating the related word dictionary, the past calculation result may be left and used when a new document is added. In this case, among the appearance rates calculated in the dictionary creation process, the lowest appearance rate (reference probability) of the word set registered as the related word set in the dictionary is stored in the management system 20. Then, after creating the related word dictionary, the documents are accumulated until the number of sampled documents is N. When adding processing to the dictionary for the accumulated document group, co-occurrence analysis is performed on the document group and the appearance rate of each word set is calculated. If this appearance rate is equal to or higher than the reference probability, it is additionally registered as a related word set. As a result, the calculation load can be reduced as compared with the case of complete revision in which the dictionary creation process is re-executed from the beginning.

・上記実施形態では、管理システム２０の制御部２１は、関連語組の集計処理を実行する（ステップＳ１−８）。この場合、ワークメモリ２５に仮記憶した出現率を用いて、辞書記憶部２４に記録する関連語組を特定する。ここで、同じ関連語組が複数登録されている場合には、最も高い出現率に関連付けられた関連語組のみを残す。集計方法は、これに限定されるものではない。関連語組が複数登録されている場合には、出現率の平均値や、最も低い出現率に関連付けられた関連語組を用いてもよい。
また、集計処理において、出現指標値として出現率を用いる場合に限定されるものではなく、単語組の出現状況を表す指標を用いることができる。例えば、単語組の出現数に応じて、辞書記憶部２４に記録する関連語組を特定してもよい。この場合には、第２サンプリングにおいて、制御部２１は、抽出した関連語組を、サブ文書群における出現数に関連付けて、ワークメモリ２５に仮記憶する。次に、集計処理において、制御部２１は、ワークメモリ２５に同じ単語組を複数、検知した場合には、同じ単語組の出現数を合計する。次に、制御部２１は、合計した出現数に応じて、関連語組を並び替える。そして、出現数が多い順番に、単語組数ｍ0の関連語組を特定して、辞書記憶部２４に記録する。 -In the above embodiment, the control unit 21 of the management system 20 executes the aggregation process of related word sets (step S1-8). In this case, the related word set to be recorded in the dictionary storage unit 24 is specified by using the appearance rate temporarily stored in the work memory 25. Here, when the same related word set is registered more than once, only the related word set associated with the highest occurrence rate is left. The aggregation method is not limited to this. When a plurality of related word sets are registered, the average value of the appearance rate or the related word set associated with the lowest appearance rate may be used.
Further, in the aggregation process, the appearance rate is not limited to the case where the appearance rate is used as the appearance index value, and an index indicating the appearance status of the word set can be used. For example, the related word set to be recorded in the dictionary storage unit 24 may be specified according to the number of occurrences of the word set. In this case, in the second sampling, the control unit 21 temporarily stores the extracted related word set in the work memory 25 in association with the number of occurrences in the sub-document group. Next, in the aggregation process, when the control unit 21 detects a plurality of the same word sets in the work memory 25, the control unit 21 totals the number of occurrences of the same word sets. Next, the control unit 21 rearranges the related word sets according to the total number of occurrences. Then, the related word sets having the number of word sets m0 are specified in descending order of the number of occurrences and recorded in the dictionary storage unit 24.

２０…管理システム、２１…制御部、２１１…前処理部、２１２…インデックス生成部、２１３…共起分析部、２１４…データ分析部、２１５…出力処理部、２２…文書セット記憶部、２３…サブ文書記憶部、２４…辞書記憶部、２５…ワークメモリ。 20 ... Management system, 21 ... Control unit, 211 ... Preprocessing unit, 212 ... Index generation unit, 213 ... Co-occurrence analysis unit, 214 ... Data analysis unit, 215 ... Output processing unit, 22 ... Document set storage unit, 23 ... Sub-document storage unit, 24 ... dictionary storage unit, 25 ... work memory.

Claims

A control unit that identifies related word sets in a document set consisting of multiple documents,
It is a related word dictionary creation system equipped with a dictionary storage unit that records related word sets.
The control unit
Documents with the number of sampled documents are extracted from the document set to generate a plurality of sub-document groups.
In each of the sub-document groups, a word set is generated using the words included in the sub-document group, and the appearance rate is calculated according to the number of documents in which the word set appears.
A related word dictionary creation system characterized in that each word set specified according to the appearance rate of each of the sub-document groups is recorded as a related word set in the dictionary storage unit.

The related word dictionary creation system according to claim 1, wherein the control unit generates a word set for each sub-document group by co-occurrence analysis.

The control unit identifies the related word set to be extracted from each sub-document group by using the total number of related word sets recorded in the dictionary storage unit, the number of documents constituting the document set, and the number of sampled documents. The related word dictionary creation system according to claim 1 or 2, which is characterized.

The control unit calculates the appearance rate of the word set in a randomly selected sub-document group, and uses the standard deviation and tolerance of the appearance rate to determine whether the number of sampled documents is statistically significant. The related word dictionary creation system according to any one of claims 1 to 3, wherein the determination is made.

When the control unit determines that the number of sampled documents is not statistically significant, the control unit changes the number of sampled documents by using the appearance rate of the word set, the number of sub-documents, and the tolerance. The related word dictionary creation system according to claim 4.

The related word dictionary creation system according to any one of claims 1 to 5, wherein the control unit deletes unnecessary parts in the document set.

The control unit is characterized in that the word constituting the related word set recorded in the dictionary storage unit is used as a node, and a graph displaying the relevance in which other words constituting the related word set are linked is generated. The related word dictionary creation system according to any one of claims 1 to 6.

A control unit that identifies related word sets in a document set consisting of multiple documents,
It is a method of creating a related word dictionary by using a related word dictionary creation system equipped with a dictionary storage unit for recording related word sets.
The control unit
Documents with the number of sampled documents are extracted from the document set to generate a plurality of sub-document groups.
In each of the sub-document groups, a word set is generated using the words included in the sub-document group, and the appearance rate is calculated according to the number of documents in which the word set appears.
A method for creating a related word dictionary, characterized in that, in each of the sub-document groups, each word set specified according to the appearance rate is recorded as a related word set in the dictionary storage unit.

A control unit that identifies related word sets in a document set consisting of multiple documents,
A program that creates a related word dictionary using a related word dictionary creation system equipped with a dictionary storage unit that records related word sets.
The control unit
Documents with the number of sampled documents are extracted from the document set to generate a plurality of sub-document groups.
In each of the sub-document groups, a word set is generated using the words included in the sub-document group, and the appearance rate is calculated according to the number of documents in which the word set appears.
A related word dictionary creating program, characterized in that, in each of the sub-document groups, each word set specified according to the appearance rate is made to function as a means for recording the related word set in the dictionary storage unit.