JP4128212B1

JP4128212B1 - Relevance calculation system between keywords and relevance calculation method

Info

Publication number: JP4128212B1
Application number: JP2007269839A
Authority: JP
Inventors: 修大島
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2007-10-17
Filing date: 2007-10-17
Publication date: 2008-07-30
Anticipated expiration: 2027-10-17
Also published as: JP2009098931A

Abstract

【課題】キーワード間の共起性に基づき、あらゆる種類のキーワード間の関連度を効率的に算出可能なシステムの実現。
【解決手段】複数の文書ファイルからキーワードを抽出するキーワード抽出部14と、各キーワードの各文書ファイル中における出現頻度に基づいて、一対のキーワード間の関連度をあらゆるキーワードの組合せについて算出し、キーワード関連度表ＤＢ26に格納する関連度算出部18を備えたキーワード間の関連度算出システム10。関連度算出部18は、文書ファイル単位で出現実績のあるキーワードの出現頻度を算出し、各キーワードの出現頻度の二乗値を算出し、この二乗値を全文書ファイルに亘って集計し、文書ファイル単位で一対のキーワード間の出現頻度の積値を算出し、この積値を全文書ファイルに亘って集計し、各キーワードの二乗値の総和の平方根を算出し、両平方根を加算し、その和で当該キーワード間の積値の総和を除することにより、関連度を算出する。
【選択図】図１A system capable of efficiently calculating the degree of association between all types of keywords based on the co-occurrence between keywords.
A keyword extracting unit that extracts keywords from a plurality of document files, and a degree of association between a pair of keywords is calculated for every keyword combination based on the appearance frequency of each keyword in each document file. The relevance calculation system 10 between keywords provided with the relevance calculation part 18 stored in relevance table DB26. The degree-of-relevance calculation unit 18 calculates the appearance frequency of keywords that have appeared in document file units, calculates the square value of the appearance frequency of each keyword, totals the square values over all document files, Calculate the product value of the appearance frequency between a pair of keywords in units, total the product values across all document files, calculate the square root of the sum of the square values of each keyword, add both square roots, and add the sum The degree of relevance is calculated by dividing the sum of product values between the keywords.
[Selection] Figure 1

Description

この発明はキーワード間の関連度算出システム及び関連度算出方法に係り、特に、入力された検索語と関連の深い用語を連鎖的に抽出したり、抽出された用語と関連の深い企業や商品、人物等を提示する連想検索の実現に不可欠な、キーワード間の関連度算出技術に関する。 The present invention relates to a relevance calculation system and a relevance calculation method between keywords, in particular, to extract a term that is closely related to an input search term, or a company or product that is closely related to an extracted term, The present invention relates to a technique for calculating the degree of association between keywords, which is indispensable for realizing an associative search for presenting people and the like.

膨大な情報の中から必要とする情報を抽出するために検索システムが用いられるが、一般的な検索システムの場合、入力された検索語と同一または類似の概念を含む情報を抽出する仕組みを備えている。例えば、多数の企業の情報を格納したデータベースに対して「富士」という検索語を与えると、検索システムは「富士」という文字列を名称中に含む企業のリストを正確に出力することができる。また、インターネットの検索サイトにおいて「環境問題」と入力すれば、「環境問題」という文字列を含んだWebページのリストがディスプレイに表示される。
この結果ユーザは、目的の情報に辿り着くことが可能となるのであるが、そこでの検索結果はあくまでも予想の範囲のものであり、検索結果リストを眺めても意外な発見を期待することはできなかった。もちろん、検索結果リスト中の個々のデータの詳細を検討する過程で新しい知見を得ることはできるが、検索語と関連の深い他の用語を含む情報を直接的に抽出することはできなかった。 A search system is used to extract necessary information from a vast amount of information. In the case of a general search system, there is a mechanism for extracting information that contains the same or similar concept as the input search term. ing. For example, if a search term “Fuji” is given to a database that stores information on a large number of companies, the search system can accurately output a list of companies that include the character string “Fuji” in the name. If you enter "environmental problem" at a search site on the Internet, a list of Web pages that contain the text "environmental problem" is displayed on the display.
As a result, the user can reach the target information, but the search results there are only in the expected range, and even if you look at the search result list, you can expect unexpected discoveries. There wasn't. Of course, new knowledge can be obtained in the process of examining details of individual data in the search result list, but information including other terms closely related to the search term cannot be extracted directly.

この点に関し、特許文献１で開示された「連想検索システム」の場合には、各用語の関連用語を記憶した関連用語記憶手段と、各用語と共起性の高い（同一文書中に登場する確率が高い）企業名を記憶した共起企業名記憶手段を備えており、検索語が入力された場合にはこれと関連する用語を抽出し、各用語に対する共起性の高い企業名を抽出する仕組みを備えている。
特開２００４−１１０３８６号 In this regard, in the case of the “associative search system” disclosed in Patent Document 1, the related term storage means that stores the related terms of each term and the co-occurrence with each term (appear in the same document) It has a co-occurrence company name storage means that stores company names (high probability). When a search term is entered, it extracts terms related to it and extracts company names with high co-occurrence for each term. It has a mechanism to do.
JP 2004-110386 A

この結果ユーザは、検索語として「環境問題」を入力すると、環境問題に係る文書中に登場することの多い企業名をダイレクトにリストアップすることが可能となり、環境問題に積極的に取り組む企業を認識し、投資行動につなげることができるようになる。 As a result, when users enter "environmental problems" as a search term, it becomes possible to directly list the names of companies that often appear in documents related to environmental problems. Recognize and connect with investment behavior.

しかしながら、この連想検索システムの場合、連想検索の対象が企業名（関連企業名を含む）に限定されるため、投資対象企業の検索以外に実用的な用途がない点で問題があった。 However, in the case of this associative search system, the object of associative search is limited to the company name (including the related company name).

すなわち、この従来の連想検索システムにあっては、有価証券報告書や新聞記事等のテキスト情報からキーワードを抽出した後、企業情報記憶部を参照して各キーワードが企業名に該当するか否かを判定し、企業名の場合には同一テキスト情報中に現れた他のキーワードと当該企業名を、関連性があるものとして共起企業名記憶手段に格納する仕組みを備えている。 That is, in this conventional associative search system, after extracting keywords from text information such as securities reports and newspaper articles, whether or not each keyword corresponds to a company name with reference to the company information storage unit. In the case of a company name, the company name is stored in the co-occurrence company name storage means as being related to other keywords appearing in the same text information.

このため、インターネットという検索語を入力すると、「インターネット」と共起性の強い企業名をピックアップすることはできるとしても、「インターネット」と共起性の強い人物や地域、技術等をピックアップすることはできない。
特許文献１には、「インターネット」の入力に対して「ブロードバンド」や「ネットワーク」、「電子メール」の検索結果が得られる例が示されているが、これらはあくまでも「インターネット」のシソーラスとして関連用語記憶手段に予め準備されていたものであり、「インターネット」との共起性に基づいて抽出されるものではない。 For this reason, if you enter the search term Internet, you can pick up a company name that has a strong co-occurrence with "Internet", but pick up people, regions, technologies, etc. that have a strong co-occurrence with "Internet". I can't.
Patent Document 1 shows an example in which search results of “broadband”, “network”, and “e-mail” can be obtained in response to input of “Internet”, but these are related as “Internet” thesaurus to the last. It is prepared in advance in the term storage means and is not extracted based on the co-occurrence with the “Internet”.

この発明は上記の問題を解決するために案出されたものであり、企業名を含めたあらゆるキーワード間の共起性に基づき、一対のキーワード間の関連度を効率的に算出可能なシステムを実現することを目的としている。 The present invention has been devised to solve the above problem, and based on the co-occurrence between all keywords including a company name, a system capable of efficiently calculating the degree of association between a pair of keywords. It is intended to be realized.

上記の目的を達成するため、請求項１に記載したキーワード間の関連度算出システムは、複数の文書ファイルが格納された文書記憶手段と、上記の各文書ファイルから複数のキーワードを抽出し、キーワード記憶手段に格納するキーワード抽出手段と、各キーワードの各文書ファイル中における出現頻度に基づいて、一対のキーワード間の関連度を全てのキーワードの組合せについて算出し、キーワード関連度記憶手段に格納する関連度算出手段とを備えており、上記関連度算出手段が、(1)文書ファイル単位で、当該文書ファイル中に出現実績のあるキーワードを探知し、これらの出現頻度を算出する処理と、(2)各キーワードの出現頻度の二乗値を算出する処理と、(3)各キーワードの出現頻度の二乗値を集計し、全文書ファイルに亘る総和を算出する処理と、(4)文書ファイル単位で、一対のキーワードについて、それぞれのキーワードの出現頻度の積を、一対のキーワード間の出現頻度の積値として算出する処理と、(5)各キーワード間の出現頻度の積値を集計し、全文書ファイルに亘る総和を算出する処理と、(6)上記(3)の総和の平方根を算出する処理と、(7)一対のキーワードの上記(6)の平方根同士を加算し、その和で上記(5)の総和を除することにより、両キーワード間の関連度を算出する処理とを実行することを特徴としている。
なお、上記(1)〜(7)の処理は、相互に論理的な矛盾が生じない限り、順番を適宜入れ替えてもよい。 In order to achieve the above object, a relevance calculation system between keywords described in claim 1 extracts a plurality of keywords from the document storage means storing a plurality of document files and each of the document files. Based on the keyword extraction means stored in the storage means and the frequency of appearance of each keyword in each document file, the degree of association between a pair of keywords is calculated for all keyword combinations and stored in the keyword relevance degree storage means A degree calculating means, wherein the relevance degree calculating means (1) detects a keyword having a history of appearance in the document file in units of document files, and calculates the frequency of appearance thereof (2 ) The process of calculating the square value of the appearance frequency of each keyword, and (3) totaling the square value of the appearance frequency of each keyword, and summing up the total over all document files Processing for output, (4) in the document file units, the pair of keywords, the product of the frequency of occurrence of each keyword, the process of calculating a product value of the frequency between a pair of keywords (5) between the keywords The product of the appearance frequencies of the document, and calculating the sum over all document files; (6) the process of calculating the square root of the sum of (3) above; and (7) the above (6) of the pair of keywords. And calculating the degree of relevance between the two keywords by adding the square roots of each other and dividing the sum of (5) by the sum.
Note that the order of the processes (1) to (7) may be appropriately changed as long as no logical contradiction occurs.

請求項２に記載したキーワード間の関連度算出システムは、管理サーバと、複数の第１の分散処理サーバと、第２の分散処理サーバとを備えたキーワード間の関連度算出システムであって、上記管理サーバが、文書記憶手段に格納された複数の文書ファイルを、各第１の分散処理サーバに分配する手段と、各第１の分散処理サーバから送信されたキーワードを、キーワード記憶手段に格納する手段と、キーワード記憶手段に格納された全キーワードを、第１の分散処理サーバに対してそれぞれ送信する手段と、各第１の分散処理サーバから送信された複数の出現頻度二乗値ファイルを、第２の分散処理サーバに送信する手段と、各第１の分散処理サーバから送信された複数種類の組合せ頻度積値ファイルを、その種類に応じて担当すべき第１の分散処理サーバに振り分け配信する手段と、第２の分散処理サーバから送信された、各キーワードの出現頻度の二乗値の全文書ファイルに亘る総和を、キーワード頻度総和表記憶手段に格納する手段と、各第１の分散処理サーバから送信された、各キーワード間の出現頻度の積値の全文書ファイルに亘る総和を、キーワード組合せ頻度総和表記憶手段に格納する手段と、上記キーワード記憶手段から一対のキーワードを取り出す手段と、上記キーワード組合せ頻度総和表記憶手段から、上記一対のキーワードについて、各キーワード間の出現頻度の積値の総和を取り出す手段と、上記キーワード頻度総和表記憶手段から、上記一対のキーワードについて、各キーワードの出現頻度の二乗値の総和を取り出す手段と、この総和の平方根をそれぞれ算出すると共に、両平方根を加算し、この和でキーワード間の出現頻度の積値の総和を除することにより、両キーワード間の関連度を算出する手段とを備え、上記第１の分散処理サーバが、管理サーバによって分配された担当文書ファイルからキーワードを抽出するキーワード抽出手段と、各キーワードを管理サーバに送信する手段と、管理サーバから全キーワードが送信された場合に、担当文書ファイルについて各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のあるキーワードの出現頻度の二乗値を算出し、出現頻度二乗値ファイルに文書ファイル毎に記述する手段と、出現実績のある一対のキーワード間で、先頭文字の文字コードが若い方を１番目に配置させたキーワードの組合せを生成する手段と、各組合せ毎に、一対のキーワードについて、それぞれのキーワードの出現頻度の積を、一対のキーワード間の出現頻度の積値として算出する手段と、１番目のキーワードの先頭文字の文字コードと、予め文字コード範囲が割り当てられた複数の組合せ頻度積値ファイルの担当文字コード範囲とを比較して、記述すべき組合せ頻度積値ファイルを特定する手段と、上記積値を、対応の組合せ頻度積値ファイルに文書ファイル毎に記述する手段と、上記出現頻度二乗値ファイル及び複数種類の組合せ頻度積値ファイルを管理サーバに送信する手段と、管理サーバから複数の同種の組合せ頻度積値ファイルが送信された場合に、各組合せ頻度積値ファイルを連結する手段と、この連結ファイルに記述されたキーワードの組合せを、各キーワードの文字コードに応じてソートする手段と、同一キーワードの組合せ単位で積値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とを備え、上記第２の分散処理サーバが、管理サーバから複数の出現頻度二乗値ファイルが送信された場合に、各出現頻度二乗値ファイルを連結する手段と、この連結ファイルに記述されたキーワードを、それぞれの文字コードに応じてソートする手段と、同一キーワード単位で出現頻度の二乗値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とを備えたことを特徴としている。
この請求項２のシステムは、キーワード抽出処理、キーワードの出現頻度二乗値ファイル生成処理、キーワード間の出現頻度積値ファイル生成処理及び出現頻度積値の全文書に亘る総和算出処理が複数の分散処理サーバによって分散処理され、出現頻度二乗値の全文書に亘る総和算出処理が単独の分散処理サーバによって処理される点に特徴を備えている。 The relevance calculation system between keywords described in claim 2 is a relevance calculation system between keywords comprising a management server, a plurality of first distributed processing servers, and a second distributed processing server, The management server distributes a plurality of document files stored in the document storage means to each first distributed processing server, and stores the keyword transmitted from each first distributed processing server in the keyword storage means. Means for transmitting all the keywords stored in the keyword storage means to the first distributed processing server, and a plurality of appearance frequency square value files transmitted from the first distributed processing servers, A means for transmitting to the second distributed processing server, and a plurality of types of combination frequency product value files transmitted from each of the first distributed processing servers according to the type of the first Means for distributing and distributing to the distributed processing server; means for storing, in the keyword frequency total table storage means, a sum total of all squares of the appearance frequency of each keyword transmitted from the second distributed processing server; Means for storing the sum of the product values of the appearance frequencies between the keywords transmitted from each first distributed processing server over all document files in the keyword combination frequency sum table storage means; From the keyword retrieval means, the keyword combination frequency sum table storage means , the means for retrieving the sum of the product values of the appearance frequencies between the keywords for the pair of keywords, and the keyword frequency sum table storage means from the pair of keywords . for keywords, and means for taking the sum of the square of the frequency of occurrence of each keyword, the square root of this sum it Means for calculating the degree of association between the two keywords by adding both square roots and dividing the sum of the product values of the appearance frequencies between the keywords by the sum. The keyword extraction means for the server to extract keywords from the responsible document file distributed by the management server, the means for transmitting each keyword to the management server, and when all keywords are transmitted from the management server, Means for detecting the presence / absence of a keyword for each document file, means for calculating a square value of the appearance frequency of a keyword with an appearance record, and describing for each document file in the appearance frequency square value file, and a pair of keywords with an appearance record Means for generating keyword combinations in which the first character code of the first character is placed first, and for each combination For each pair of keywords, means for calculating the product of the appearance frequencies of each keyword as a product value of the appearance frequencies between the pair of keywords, the character code of the first character of the first keyword, and a character code range assigned in advance Means for identifying the combination frequency product value file to be described by comparing the assigned character code ranges of the plurality of combination frequency product value files, and the product value in the corresponding combination frequency product value file for each document file. Means for transmitting the appearance frequency square value file and the plurality of types of combination frequency product value files to the management server, and when a plurality of the same type of combination frequency product value files are transmitted from the management server, The means for concatenating the combination frequency product file and the combination of keywords described in the concatenation file correspond to the character code of each keyword. The second distributed processing server, and means for summing up product values in units of combinations of the same keywords, calculating a sum total over all document files, and means for transmitting the sum to the management server. However, when a plurality of appearance frequency square value files are transmitted from the management server, the means for concatenating the appearance frequency square value files and the keywords described in the concatenation file are sorted according to their character codes. And a means for calculating the sum of squares of appearance frequencies in units of the same keyword, calculating a sum total over all document files, and a means for transmitting the sum to the management server.
The system according to claim 2 includes a plurality of distributed processes including a keyword extraction process, a keyword appearance frequency square value file generation process, an appearance frequency product value file generation process between keywords, and a sum calculation process for all appearance frequency product values over all documents. It is characterized in that it is distributed by the server and the summation calculation process for all the documents of the appearance frequency square value is processed by a single distributed processing server.

請求項３に記載したキーワード間の関連度算出システムは、管理サーバと、複数の分散処理サーバとを備えたキーワード間の関連度算出システムであって、上記管理サーバが、文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第１の分散処理サーバに分配する手段と、キーワード記憶手段に格納された複数のキーワードを、第１の分散処理サーバに対してそれぞれ送信する手段と、各第１の分散処理サーバから送信された複数種類の組合せ頻度積値ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第２の分散処理サーバに対し、それぞれの担当に応じた種類毎に振り分け配信する手段と、各第２の分散処理サーバから送信された、各キーワード間の出現頻度の積値の全文書ファイルに亘る総和を、キーワード組合せ頻度総和表記憶手段に格納する手段と、上記文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第３の分散処理サーバに分配する手段と、上記キーワード記憶手段に格納された複数のキーワードを、第３の分散処理サーバに対してそれぞれ送信する手段と、各第３の分散処理サーバから送信された複数種類の出現頻度二乗値ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第４の分散処理サーバに対し、それぞれの担当に応じた種類毎に振り分け配信する手段と、各第４の分散処理サーバから送信された、各キーワードの出現頻度の二乗値の全文書ファイルに亘る総和を、キーワード頻度総和表記憶手段に格納する手段と、上記キーワード記憶手段から一対のキーワードを取り出す手段と、上記キーワード組合せ頻度総和表記憶手段から、上記一対のキーワードについて、各キーワード間の出現頻度の積値の総和を取り出す手段と、上記キーワード頻度総和表記憶手段から、上記一対のキーワードについて、各キーワードの出現頻度の二乗値の総和を取り出す手段と、この総和の平方根をそれぞれ算出すると共に、両平方根を加算し、この和でキーワード間の出現頻度の積値の総和を除することにより、両キーワード間の関連度を算出する手段とを備え、上記の各第１の分散処理サーバが、管理サーバによって分配された担当文書ファイルについて、各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のある一対のキーワード間で、先頭文字の文字コードが若い方を１番目に配置させたキーワードの組合せを生成する手段と、各組合せ毎に、一対のキーワードについて、それぞれのキーワードの出現頻度の積を、一対のキーワード間の出現頻度の積値として算出する手段と、１番目のキーワードの先頭文字の文字コードと、予め文字コード範囲が割り当てられた複数の組合せ頻度積値ファイルの担当文字コード範囲とを比較して、記述すべき組合せ頻度積値ファイルを特定する手段と、上記積値を、対応の組合せ頻度積値ファイルに文書ファイル毎に記述する手段と、これら複数種類の組合せ頻度積値ファイルを管理サーバに送信する手段とをそれぞれ備え、上記の各第２の分散処理サーバが、管理サーバから複数の同種の組合せ頻度積値ファイルが送信された場合に、各組合せ頻度積値ファイルを連結する手段と、この連結ファイルに記述されたキーワードの組合せを、各キーワードの文字コードに応じてソートする手段と、同一キーワードの組合せ単位で積値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とをそれぞれ備え、上記の各第３の分散処理サーバが、管理サーバによって分配された担当文書ファイルについて、各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のあるキーワードの出現頻度の二乗値を算出する手段と、各キーワードの文字コードと、予め文字コード範囲が割り当てられた複数の出現頻度二乗値ファイルの担当文字コード範囲とを比較して、記述すべき出現頻度二乗値ファイルを特定する手段と、上記二乗値を、対応の出現頻度二乗値ファイルに文書ファイル毎に記述する手段と、これら複数種類の出現頻度二乗値ファイルを管理サーバに送信する手段とをそれぞれ備え、上記の各第４の分散処理サーバが、管理サーバから複数の同種の出現頻度二乗値ファイルが送信された場合に、各出現頻度二乗値ファイルを連結する手段と、この連結ファイルに記述されたキーワードを、それぞれの文字コードに応じてソートする手段と、同一キーワード単位で出現頻度の二乗値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とをそれぞれ備えたことを特徴としている。
この請求項３のシステムは、キーワード間の出現頻度積値ファイル生成処理、出現頻度積値の全文書に亘る総和算出処理、キーワードの出現頻度二乗値ファイル生成処理、出現頻度二乗値の全文書に亘る総和算出処理のそれぞれが、複数の分散処理サーバによって分散処理される点に特徴を備えている。
なお、上記第１〜第４の分散処理サーバは、それぞれの機能に着目した論理的な区分けであり、各分散処理サーバが物理的に独立している場合はもちろん、相互に重複している場合もあり得る。物理的に重複している場合、サーバ間におけるファイルの送受信は必要な限度で実行されれば足りる。例えば、第１の分散処理サーバと第３の分散処理サーバが共通のサーバマシンによって構成される場合、管理サーバは第１のサーバに対して担当文書ファイルを配信すれば足り、第３の分散処理サーバに対して同一の文書ファイルを重複配信することを省略することができる（請求項４〜６の発明についても同様）。 The relevance calculation system between keywords described in claim 3 is a relevance calculation system between keywords comprising a management server and a plurality of distributed processing servers, wherein the management server is stored in the document storage means. Means for distributing the plurality of document files to a plurality of first distributed processing servers comprising at least a part of the plurality of distributed processing servers, and a plurality of keywords stored in the keyword storage means. Means for transmitting to each of the distributed processing servers, and a plurality of types of combination frequency product value files transmitted from each of the first distributed processing servers; Between each keyword transmitted from each second distributed processing server and means for distributing and distributing to each distributed processing server according to the type in charge of each of the two distributed processing servers Means for storing the sum of product values of appearance frequencies over all document files in the keyword combination frequency sum table storage means, and a plurality of document files stored in the document storage means in the plurality of distributed processing servers. Means for distributing to at least a portion of a plurality of third distributed processing servers, means for transmitting a plurality of keywords stored in the keyword storage means to each of the third distributed processing servers, and each third A plurality of types of appearance frequency square value files transmitted from the plurality of distributed processing servers are classified into a plurality of fourth distributed processing servers consisting of at least a part of the plurality of distributed processing servers according to their respective responsibilities. The distribution sum for each keyword and the sum total over all document files of the square value of the frequency of appearance of each keyword transmitted from each fourth distributed processing server Means for storing the de-frequency sum table storage means, and means for taking out the keyword from the storage means of a pair keywords from the keyword combination frequency sum table storage means for the pair of keywords, the product value of the frequency between the keywords And a means for extracting the sum of squares of the appearance frequency of each keyword for the pair of keywords from the keyword frequency total table storage means , calculating a square root of the sum, and calculating both square roots. Means for calculating the relevance between the two keywords by adding and dividing the sum of the product values of the appearance frequencies between the keywords by the sum, and each of the first distributed processing servers is configured by the management server For the distributed document file, there is a means for detecting the existence of each keyword for each document file and the appearance history. Means for generating a combination of keywords in which the first character code of the first character is arranged first among a pair of keywords, and the product of the appearance frequency of each keyword for each pair of keywords A means for calculating a product value of appearance frequencies between a pair of keywords, a character code of the first character of the first keyword, and a character code range in charge of a plurality of combination frequency product value files to which a character code range is assigned in advance For identifying a combination frequency product value file to be described, a means for describing the product value in a corresponding combination frequency product value file for each document file, and a plurality of types of combination frequency product value files Each of the second distributed processing servers includes a plurality of similar combination frequency product value files from the management server. Is transmitted, means for concatenating each combination frequency product value file, means for sorting the keyword combinations described in the concatenated file according to the character code of each keyword, and the same keyword combination unit Each of the above-mentioned third distributed processing servers is in charge of distribution by the management server, each of which has means for summing up product values and calculating a sum total over all document files and means for transmitting the sum to the management server. For a document file, a means for detecting the presence or absence of each keyword for each document file, a means for calculating a square value of the appearance frequency of a keyword with an appearance record, a character code of each keyword, and a character code range are assigned in advance. A method for identifying the appearance frequency square value file to be described by comparing the character code ranges of multiple appearance frequency square value files. And means for describing the square value for each document file in the corresponding appearance frequency square value file, and means for transmitting the plurality of types of appearance frequency square value files to the management server, respectively. When a plurality of the same type appearance frequency square value files are transmitted from the management server, the distributed processing server in FIG. 6 connects the appearance frequency square value files and the keywords described in the concatenation file with respective characters. A means for sorting according to the code, a means for calculating the square value of the appearance frequency in the same keyword unit, calculating a sum total over all document files, and a means for transmitting the sum to the management server. It is a feature.
The system according to claim 3 includes an appearance frequency product value file generation process between keywords, a sum calculation process for all appearance frequency product values, a keyword appearance frequency square value file generation process, and an appearance frequency square value all documents. Each of the total sum calculation processes is characterized in that it is distributed by a plurality of distributed processing servers.
The first to fourth distributed processing servers are logical divisions focusing on the respective functions, and when the distributed processing servers are physically independent, of course, they overlap each other. There is also a possibility. If there is a physical overlap, sending and receiving files between servers need only be performed as much as necessary. For example, when the first distributed processing server and the third distributed processing server are configured by a common server machine, the management server only needs to distribute the responsible document file to the first server, and the third distributed processing Duplicate delivery of the same document file to the server can be omitted (the same applies to the inventions of claims 4 to 6).

請求項４に記載したキーワード間の関連度算出システムは、管理サーバと、複数の分散処理サーバとを備えたキーワード間の関連度算出システムであって、上記管理サーバが、文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第１の分散処理サーバに分配する手段と、キーワード記憶手段に格納された複数のキーワードを、第１の分散処理サーバに対してそれぞれ送信する手段と、各第１の分散処理サーバから送信された組合せ頻度積値ファイルを、上記複数の分散処理サーバの中の一つである第２の分散処理サーバに送信する手段と、第２の分散処理サーバから送信された、各キーワード間の出現頻度の積値の全文書ファイルに亘る総和を、キーワード組合せ頻度総和表記憶手段に格納する手段と、上記文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第３の分散処理サーバに分配する手段と、上記キーワード記憶手段に格納された複数のキーワードを、第３の分散処理サーバに対してそれぞれ送信する手段と、各第３の分散処理サーバから送信された出現頻度二乗値ファイルを、上記複数の分散処理サーバの中の一つである第４の分散処理サーバに送信する手段と、第４の分散処理サーバから送信された、各キーワードの出現頻度の二乗値の全文書ファイルに亘る総和を、キーワード頻度総和表記憶手段に格納する手段と、上記キーワード記憶手段から一対のキーワードを取り出す手段と、上記キーワード組合せ頻度総和表記憶手段から、上記一対のキーワードについて、各キーワード間の出現頻度の積値の総和を取り出す手段と、上記キーワード頻度総和表記憶手段から、上記一対のキーワードについて、各キーワードの出現頻度の二乗値の総和を取り出す手段と、この総和の平方根をそれぞれ算出すると共に、両平方根を加算し、この和でキーワード間の出現頻度の積値の総和を除することにより、両キーワード間の関連度を算出する手段とを備え、上記の各第１の分散処理サーバが、管理サーバによって分配された担当文書ファイルについて、各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のある一対のキーワード間で、先頭文字の文字コードが若い方を１番目に配置させたキーワードの組合せを生成する手段と、各組合せ毎に、一対のキーワードについて、それぞれのキーワードの出現頻度の積を、一対のキーワード間の出現頻度の積値として算出する手段と、上記積値を、組合せ頻度積値ファイルに文書ファイル毎に記述する手段と、この組合せ頻度積値ファイルを管理サーバに送信する手段とを備え、
上記第２の分散処理サーバが、管理サーバから複数の組合せ頻度積値ファイルが送信された場合に、各組合せ頻度積値ファイルを連結する手段と、この連結ファイルに記述されたキーワードの組合せを、各キーワードの文字コードに応じてソートする手段と、同一キーワードの組合せ単位で積値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とを備え、上記の各第３の分散処理サーバが、管理サーバによって分配された担当文書ファイルについて、各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のあるキーワードの出現頻度の二乗値を算出する手段と、上記二乗値を、出現頻度二乗値ファイルに文書ファイル毎に記述する手段と、この出現頻度二乗値ファイルを管理サーバに送信する手段とを備え、上記第４の分散処理サーバが、管理サーバから複数の出現頻度二乗値ファイルが送信された場合に、各出現頻度二乗値ファイルを連結する手段と、この連結ファイルに記述されたキーワードを、それぞれの文字コードに応じてソートする手段と、同一キーワード単位で出現頻度の二乗値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とを備えたことを特徴としている。
この請求項４のシステムは、キーワード間の出現頻度積値ファイル生成処理及びキーワードの出現頻度二乗値ファイル生成処理が複数の分散処理サーバによって分散処理され、出現頻度積値の全文書に亘る総和算出処理及び出現頻度二乗値の全文書に亘る総和算出処理が単独の分散処理サーバによって処理される点に特徴を備えている。 The inter-keyword relevance calculation system according to claim 4 is a keyword relevance calculation system comprising a management server and a plurality of distributed processing servers, wherein the management server is stored in the document storage means. Means for distributing the plurality of document files to a plurality of first distributed processing servers comprising at least a part of the plurality of distributed processing servers, and a plurality of keywords stored in the keyword storage means. The means for transmitting to each of the distributed processing servers and the combined frequency product value file transmitted from each of the first distributed processing servers are sent to the second distributed processing server that is one of the plurality of distributed processing servers. The sum of the product values of the appearance frequencies between the keywords transmitted from the second distributed processing server and all document files is displayed as a keyword combination frequency sum total. Means for storing in the means, means for distributing the plurality of document files stored in the document storage means to a plurality of third distributed processing servers comprising at least a part of the plurality of distributed processing servers, Means for transmitting a plurality of keywords stored in the keyword storage means to each of the third distributed processing servers, and an appearance frequency square value file transmitted from each of the third distributed processing servers. A means for transmitting to a fourth distributed processing server that is one of the servers, and a sum of the square values of the appearance frequencies of the keywords transmitted from the fourth distributed processing server over all document files means for storing the sum table storage means, and means for retrieving a pair of keywords from the keyword storage unit, from the keyword combination frequency sum table storage means, the For pairs of keywords, and means for taking the sum of the product values of frequency between each keyword from the keyword frequency summation table storage means for the pair of keywords, and means for taking the sum of the square of the frequency of occurrence of the keywords And calculating the relevance between the two keywords by calculating the square root of each sum, adding both square roots, and dividing the sum of the product values of the appearance frequencies between the keywords by this sum, Each of the first distributed processing servers is configured to detect the presence or absence of each keyword for each document file with respect to the document file distributed by the management server, and the first character between a pair of keywords having a history of appearance. Means for generating a combination of keywords in which the younger code is placed first, and a pair of keywords for each combination , Means for calculating a product of appearance frequencies of each keyword as a product value of appearance frequencies between a pair of keywords, means for describing the product value in a combination frequency product value file for each document file, and the combination frequency product Means for transmitting the value file to the management server,
When the second distributed processing server transmits a plurality of combination frequency product value files from the management server, a means for concatenating each combination frequency product value file, and a combination of keywords described in the concatenation file, Means for sorting according to the character code of each keyword, means for totalizing product values in combination units of the same keyword, calculating a sum total over all document files, means for transmitting this sum to the management server, Each of the third distributed processing servers calculates means for detecting the presence or absence of each keyword for each document file for the document file distributed by the management server, and calculates the square value of the appearance frequency of the keyword having the appearance record. Means for describing the square value for each document file in the appearance frequency square value file, and the appearance frequency square value file is stored in the management server. And means for connecting the appearance frequency square value files when the fourth distributed processing server transmits a plurality of appearance frequency square value files from the management server, and is described in the connection file. A means for sorting the generated keywords according to their character codes, a means for calculating the square value of the appearance frequency in the same keyword unit, calculating a sum total over all document files, and transmitting the sum to the management server. And a means.
In the system according to claim 4, the appearance frequency product value file generation processing between keywords and the keyword appearance frequency square value file generation processing are distributed by a plurality of distributed processing servers, and the sum of the appearance frequency product values over all documents is calculated. The processing is characterized in that the sum calculation processing for all documents of the processing and the appearance frequency square value is processed by a single distributed processing server.

請求項５に記載したキーワード間の関連度算出システムは、管理サーバと、複数の分散処理サーバとを備えたキーワード間の関連度算出システムであって、上記管理サーバが、文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第１の分散処理サーバに分配する手段と、キーワード記憶手段に格納された複数のキーワードを、第１の分散処理サーバに対してそれぞれ送信する手段と、各第１の分散処理サーバから送信された複数種類の組合せ頻度積値ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第２の分散処理サーバに対し、それぞれの担当に応じた種類毎に振り分け配信する手段と、第２の分散処理サーバから送信された、各キーワード間の出現頻度の積値の全文書ファイルに亘る総和を、キーワード組合せ頻度総和表記憶手段に格納する手段と、上記文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第３の分散処理サーバに分配する手段と、上記キーワード記憶手段に格納された複数のキーワードを、第３の分散処理サーバに対してそれぞれ送信する手段と、各第３の分散処理サーバから送信された出現頻度二乗値ファイルを、上記複数の分散処理サーバの中の一つである第４の分散処理サーバに送信する手段と、第４の分散処理サーバから送信された、各キーワードの出現頻度の二乗値の全文書ファイルに亘る総和を、キーワード頻度総和表記憶手段に格納する手段と、上記キーワード記憶手段から一対のキーワードを取り出す手段と、上記キーワード組合せ頻度総和表記憶手段から、上記一対のキーワードについて、各キーワード間の出現頻度の積値の総和を取り出す手段と、上記キーワード頻度総和表記憶手段から、上記一対のキーワードについて、各キーワードの出現頻度の二乗値の総和を取り出す手段と、この総和の平方根をそれぞれ算出すると共に、両平方根を加算し、この和でキーワード間の出現頻度の積値の総和を除することにより、両キーワード間の関連度を算出する手段とを備え、上記の各第１の分散処理サーバが、管理サーバによって分配された担当文書ファイルについて、各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のある一対のキーワード間で、先頭文字の文字コードが若い方を１番目に配置させたキーワードの組合せを生成する手段と、各組合せ毎に、一対のキーワードについて、それぞれのキーワードの出現頻度の積を、一対のキーワード間の出現頻度の積値として算出する手段と、１番目のキーワードの先頭文字の文字コードと、予め文字コード範囲が割り当てられた複数の組合せ頻度積値ファイルの担当文字コード範囲とを比較して、記述すべき組合せ頻度積値ファイルを特定する手段と、上記積値を、対応の組合せ頻度積値ファイルに文書ファイル毎に記述する手段と、これら複数種類の組合せ頻度積値ファイルを管理サーバに送信する手段とを備え、上記の各第２の分散処理サーバが、管理サーバから複数の同種の組合せ頻度積値ファイルが送信された場合に、各組合せ頻度積値ファイルを連結する手段と、この連結ファイルに記述されたキーワードの組合せを、各キーワードの文字コードに応じてソートする手段と、同一キーワードの組合せ単位で積値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とを備え、上記の各第３の分散処理サーバが、管理サーバによって分配された担当文書ファイルについて、各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のあるキーワードの出現頻度の二乗値を算出する手段と、上記二乗値を、出現頻度二乗値ファイルに文書ファイル毎に記述する手段と、この出現頻度二乗値ファイルを管理サーバに送信する手段とを備え、上記第４の分散処理サーバが、管理サーバから複数の出現頻度二乗値ファイルが送信された場合に、各出現頻度二乗値ファイルを連結する手段と、この連結ファイルに記述されたキーワードを、それぞれの文字コードに応じてソートする手段と、同一キーワード単位で出現頻度の二乗値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とを備えたことを特徴としている。
この請求項５のシステムは、キーワード間の出現頻度積値ファイル生成処理、出現頻度積値の全文書に亘る総和算出処理及びキーワードの出現頻度二乗値ファイル生成処理が複数の分散処理サーバによって分散処理され、出現頻度二乗値の全文書に亘る総和算出処理が単独の分散処理サーバによって処理される点に特徴を備えている。 The relevance calculation system between keywords described in claim 5 is a relevance calculation system between keywords comprising a management server and a plurality of distributed processing servers, wherein the management server is stored in the document storage means. Means for distributing the plurality of document files to a plurality of first distributed processing servers comprising at least a part of the plurality of distributed processing servers, and a plurality of keywords stored in the keyword storage means. Means for transmitting to each of the distributed processing servers, and a plurality of types of combination frequency product value files transmitted from each of the first distributed processing servers; Between the two distributed processing servers, the means for distributing and distributing each type according to each charge, and the keywords transmitted from the second distributed processing server Means for storing the sum of all current product values of all product files in the keyword combination frequency sum table storage means; and a plurality of document files stored in the document storage means in the plurality of distributed processing servers. Means for distributing to at least a portion of a plurality of third distributed processing servers, means for transmitting a plurality of keywords stored in the keyword storage means to each of the third distributed processing servers, and each third Means for transmitting the appearance frequency square value file transmitted from the distributed processing server to a fourth distributed processing server which is one of the plurality of distributed processing servers, and the fourth distributed processing server. A means for storing the sum of the squares of the appearance frequency of each keyword over all document files in the keyword frequency sum table storage means, and a pair of keys from the keyword storage means. Means for taking out over de, from the keyword combination frequency sum table storage means for the pair of keywords, and means for taking the sum of the product values of frequency between each keyword from the keyword frequency summation table storage means, the pair For each keyword, a means to extract the sum of the squares of the appearance frequency of each keyword, calculate the square root of this sum, add both square roots, and use this sum to divide the sum of the product values of the appearance frequency between keywords. The first distributed processing server detects the presence or absence of each keyword for each document file with respect to the assigned document file distributed by the management server. Key with the younger character code of the first character placed between the pairing keyword and the keyword with the appearance record Means for generating a combination of words, means for calculating the product of the appearance frequencies of each keyword for each pair of keywords as a product value of the appearance frequencies between the pair of keywords, and the first keyword Means for comparing the character code of the first character with the assigned character code range of a plurality of combination frequency product value files to which a character code range has been assigned in advance to identify a combination frequency product value file to be described; For each document file in the corresponding combination frequency product value file, and means for transmitting the plurality of types of combination frequency product value files to the management server, and each of the second distributed processing servers includes: When a plurality of similar combination frequency product value files are transmitted from the management server, means for concatenating each combination frequency product value file and Means for sorting the keyword combinations according to the character code of each keyword, means for summing up the product values in the same keyword combination unit, calculating the sum over all document files, and this sum to the management server Means for detecting each keyword file for the document file distributed by the management server for each document file, and the appearance of a keyword with a history of appearance. Means for calculating a square value of the frequency, means for describing the square value for each document file in the appearance frequency square value file, and means for transmitting the appearance frequency square value file to the management server. The distributed processing server, when a plurality of appearance frequency square value files are transmitted from the management server, means for concatenating the appearance frequency square value files A means for sorting the keywords described in the concatenated file according to each character code; a means for calculating the sum of squares of appearance frequencies in units of the same keyword and calculating a total over all document files; and Means for transmitting to the management server.
In this system, the appearance frequency product value file generation process between keywords, the summation calculation process of all the appearance frequency product values over all documents, and the keyword appearance frequency square value file generation process are distributed by a plurality of distributed processing servers. Further, the present invention is characterized in that the sum total calculation process over all documents of the appearance frequency square value is processed by a single distributed processing server.

請求項６に記載したキーワード間の関連度算出システムは、管理サーバと、複数の分散処理サーバとを備えたキーワード間の関連度算出システムであって、上記管理サーバが、文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第１の分散処理サーバに分配する手段と、キーワード記憶手段に格納された複数のキーワードを、第１の分散処理サーバに対してそれぞれ送信する手段と、各第１の分散処理サーバから送信された組合せ頻度積値ファイルを、上記複数の分散処理サーバの中の一つである第２の分散処理サーバに送信する手段と、第２の分散処理サーバから送信された、各キーワード間の出現頻度の積値の全文書ファイルに亘る総和を、キーワード組合せ頻度総和表記憶手段に格納する手段と、上記文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第３の分散処理サーバに分配する手段と、上記キーワード記憶手段に格納された複数のキーワードを、第３の分散処理サーバに対してそれぞれ送信する手段と、各第３の分散処理サーバから送信された複数種類の出現頻度二乗値ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の第４の分散処理サーバに対し、それぞれの担当に応じた種類毎に振り分け配信する手段と、各第４の分散処理サーバから送信された、各キーワードの出現頻度の二乗値の全文書ファイルに亘る総和を、キーワード頻度総和表記憶手段に格納する手段と、上記キーワード記憶手段から一対のキーワードを取り出す手段と、上記キーワード組合せ頻度総和表記憶手段から、上記一対のキーワードについて、各キーワード間の出現頻度の積値の総和を取り出す手段と、上記キーワード頻度総和表記憶手段から、上記一対のキーワードについて、各キーワードの出現頻度の二乗値の総和を取り出す手段と、この総和の平方根をそれぞれ算出すると共に、両平方根を加算し、この和でキーワード間の出現頻度の積値の総和を除することにより、両キーワード間の関連度を算出する手段とを備え、上記の各第１の分散処理サーバが、管理サーバによって分配された担当文書ファイルについて、各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のある一対のキーワード間で、先頭文字の文字コードが若い方を１番目に配置させたキーワードの組合せを生成する手段と、各組合せ毎に、一対のキーワードについて、それぞれのキーワードの出現頻度の積を、一対のキーワード間の出現頻度の積値として算出する手段と、上記積値を、組合せ頻度積値ファイルに文書ファイル毎に記述する手段と、この組合せ頻度積値ファイルを管理サーバに送信する手段とを備え、上記第２の分散処理サーバが、管理サーバから複数の組合せ頻度積値ファイルが送信された場合に、各組合せ頻度積値ファイルを連結する手段と、この連結ファイルに記述されたキーワードの組合せを、各キーワードの文字コードに応じてソートする手段と、同一キーワードの組合せ単位で積値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とを備え、上記の各第３の分散処理サーバが、管理サーバによって分配された担当文書ファイルについて、各キーワードの有無を文書ファイル毎に探知する手段と、出現実績のあるキーワードの出現頻度の二乗値を算出する手段と、各キーワードの文字コードと、予め文字コード範囲が割り当てられた複数の出現頻度二乗値ファイルの担当文字コード範囲とを比較して、記述すべき出現頻度二乗値ファイルを特定する手段と、上記二乗値を、対応の出現頻度二乗値ファイルに文書ファイル毎に記述する手段と、これら複数種類の出現頻度二乗値ファイルを管理サーバに送信する手段とを備え、上記の各第４の分散処理サーバが、管理サーバから複数の同種の出現頻度二乗値ファイルが送信された場合に、各出現頻度二乗値ファイルを連結する手段と、この連結ファイルに記述されたキーワードを、それぞれの文字コードに応じてソートする手段と、同一キーワード単位で出現頻度の二乗値を集計し、全文書ファイルに亘る総和を算出する手段と、この総和を管理サーバに送信する手段とを備えたことを特徴としている。
この請求項６のシステムは、キーワード間の出現頻度積値ファイル生成処理、キーワードの出現頻度二乗値ファイル生成処理及び出現頻度二乗値の全文書に亘る総和算出処理が複数の分散処理サーバによって分散処理され、出現頻度積値の全文書に亘る総和算出処理が単独の分散処理サーバによって処理される点に特徴を備えている。 The relevance calculation system between keywords described in claim 6 is a relevance calculation system between keywords comprising a management server and a plurality of distributed processing servers, wherein the management server is stored in the document storage means. Means for distributing the plurality of document files to a plurality of first distributed processing servers comprising at least a part of the plurality of distributed processing servers, and a plurality of keywords stored in the keyword storage means. The means for transmitting to each of the distributed processing servers and the combined frequency product value file transmitted from each of the first distributed processing servers are sent to the second distributed processing server that is one of the plurality of distributed processing servers. The sum of the product values of the appearance frequencies between the keywords transmitted from the second distributed processing server and all document files is displayed as a keyword combination frequency sum total. Means for storing in the means, means for distributing the plurality of document files stored in the document storage means to a plurality of third distributed processing servers comprising at least a part of the plurality of distributed processing servers, Means for transmitting a plurality of keywords stored in the keyword storage means to the third distributed processing server, and a plurality of types of appearance frequency square value files transmitted from the respective third distributed processing servers; To each of the plurality of fourth distributed processing servers, and to each of the fourth distributed processing servers. The sum of the squares of the appearance frequency of each keyword over all document files is stored in the keyword frequency total table storage means, and a pair of keys from the keyword storage means. Means for retrieving a word from the keyword combination frequency sum table storage means for the pair of keywords, and means for taking the sum of the product values of frequency between each keyword from the keyword frequency summation table storage means, the pair For a keyword, a means for extracting the sum of squares of the appearance frequency of each keyword, and calculating the square root of this sum, adding both square roots, and dividing the sum of the product values of appearance frequencies between keywords by this sum. Accordingly, the first distributed processing server detects the presence / absence of each keyword for each document file in the assigned document file distributed by the management server. The key with the youngest character code of the first character placed between the means and a pair of keywords that have a history of appearance. A means for generating a combination of words, a means for calculating , for each pair of keywords , a product of appearance frequencies of each keyword as a product value of appearance frequencies between a pair of keywords, and the product value, Means for describing each document file in the combination frequency product value file, and means for transmitting the combination frequency product value file to the management server, wherein the second distributed processing server receives a plurality of combination frequency product values from the management server. A means for concatenating each combination frequency product file when the file is transmitted, a means for sorting the keyword combinations described in the concatenated file according to the character code of each keyword, and a combination unit of the same keyword Means for calculating the sum of the product values and calculating the sum over all document files, and means for transmitting the sum to the management server. Each of the distributed processing servers 3 detects the presence / absence of each keyword in the document file distributed by the management server for each document file, calculates a square value of the appearance frequency of the keyword having the appearance record, Means for comparing the character code of the keyword with the assigned character code range of a plurality of appearance frequency square value files to which a character code range is assigned in advance, and specifying the appearance frequency square value file to be described; , A means for describing each document file in a corresponding appearance frequency square value file, and a means for transmitting the plurality of types of appearance frequency square value files to the management server. When a plurality of similar appearance frequency square value files are transmitted from the server, means for concatenating each appearance frequency square value file and the concatenation file Means for sorting the keywords described in the file according to their character codes, means for summing up the squares of the appearance frequency in the same keyword unit, calculating the sum over all document files, and managing the sum And a means for transmitting to.
In the system according to claim 6, the appearance frequency product value file generation processing between keywords, the keyword appearance frequency square value file generation processing, and the summation calculation processing over all the documents of the appearance frequency square values are distributed by a plurality of distributed processing servers. In addition, the present invention is characterized in that the sum total calculation process of all appearance frequency product values over all documents is processed by a single distributed processing server.

請求項７に記載したキーワード間の関連度算出システムは、請求項３〜６のシステムであって、さらに上記管理サーバが、文書記憶手段に格納された複数の文書ファイルを、上記複数の分散処理サーバの中の少なくとも一部からなる複数の分散処理サーバに対して事前に分配し、キーワードの抽出を指令する手段と、各分散処理サーバから送信されたキーワードを、上記キーワード記憶手段に格納する手段とを備え、上記の各分散処理サーバが、管理サーバによって分配された担当文書ファイルからキーワードを抽出するキーワード抽出手段と、各キーワードを管理サーバに送信する手段とを備えたことを特徴としている。 The system for calculating the degree of association between keywords described in claim 7 is the system according to claims 3 to 6, wherein the management server further converts the plurality of document files stored in the document storage means into the plurality of distributed processes. Means for instructing keyword extraction in advance to a plurality of distributed processing servers comprising at least some of the servers, and means for storing the keywords transmitted from each distributed processing server in the keyword storage means Each of the distributed processing servers includes a keyword extracting unit that extracts a keyword from a document file distributed by the management server, and a unit that transmits the keyword to the management server.

請求項８に記載したキーワード間の関連度算出方法は、管理サーバと、複数の第１の分散処理サーバと、第２の分散処理サーバとの連携に基づくキーワード間の関連度算出方法であって、上記管理サーバが、文書記憶手段に格納された複数の文書ファイルを、各第１の分散処理サーバに分配するステップと、各第１の分散処理サーバが、管理サーバによって送信された担当文書ファイルからキーワードを抽出し、管理サーバに送信するステップと、管理サーバが、各第１の分散処理サーバから送信されたキーワードをキーワード記憶手段に格納した後、全キーワードを第１の分散処理サーバに対してそれぞれ送信するステップと、これを受けた第１の分散処理サーバが、担当文書ファイルについて各キーワードの有無を文書ファイル毎に探知するステップと、出現実績のあるキーワードの出現頻度の二乗値を算出し、出現頻度二乗値ファイルに文書ファイル毎に記述するステップと、出現実績のある一対のキーワード間で、先頭文字の文字コードが若い方を１番目に配置させたキーワードの組合せを生成するステップと、各組合せ毎に、一対のキーワードについて、それぞれのキーワードの出現頻度の積を、一対のキーワード間の出現頻度の積値として算出するステップと、１番目のキーワードの先頭文字の文字コードと、予め文字コード範囲が割り当てられた複数の組合せ頻度積値ファイルの担当文字コード範囲とを比較して、記述すべき組合せ頻度積値ファイルを特定するステップと、各組合せに係るキーワード間の出現頻度の積値を、対応の組合せ頻度積値ファイルに文書ファイル毎に記述するステップと、上記出現頻度二乗値ファイル及び複数種類の組合せ頻度積値ファイルを管理サーバに送信するステップと、管理サーバが、各第１の分散処理サーバから送信された複数の出現頻度二乗値ファイルを、第２の分散処理サーバに送信するステップと、各第１の分散処理サーバから送信された複数種類の組合せ頻度積値ファイルを、組合せ頻度積値ファイルの種類毎にそれぞれ担当が決められた第１の分散処理サーバに振り分け配信するステップと、管理サーバから複数の出現頻度二乗値ファイルを送信された第２の分散処理サーバが、各出現頻度二乗値ファイルを連結するステップと、この連結ファイルに記述された各キーワードを、それぞれの文字コードに応じてソートするステップと、同一キーワード単位で出現頻度の二乗値を集計し、全文書ファイルに亘る総和を算出するステップと、この総和を管理サーバに送信するステップと、管理サーバが、この第２の分散処理サーバから送信された出現頻度の二乗値の総和を、キーワード頻度総和表記憶手段に格納するステップと、管理サーバから複数の組合せ頻度積値ファイルを送信された第１の分散処理サーバが、各組合せ頻度積値ファイルを連結するステップと、この連結ファイルに記述された各キーワードの組合せを、各キーワードの文字コードに応じてソートするステップと、同一キーワードの組合せ単位で出現頻度の積値を集計し、全文書ファイルに亘る総和を算出するステップと、この総和を管理サーバに送信するステップと、管理サーバが、各第１の分散処理サーバから送信された積値の総和を、キーワード組合せ頻度総和表記憶手段に格納するステップと、上記キーワード記憶手段から一対のキーワードを取り出すステップと、上記キーワード組合せ頻度総和表記憶手段から、上記一対のキーワードについて、各キーワード間の出現頻度の積値の総和を取り出すステップと、上記キーワード頻度総和表記憶手段から、上記一対のキーワードについて、各キーワードの出現頻度の二乗値の総和を取り出すステップと、この総和の平方根をそれぞれ算出すると共に、両平方根を加算し、この和でキーワード間の出現頻度の積値の総和を除することにより、両キーワード間の関連度を算出するステップとを備えたことを特徴としている。
なお、上記の各ステップは、相互に論理的な矛盾が生じない限り、順番を適宜入れ替えてもよい。 The method for calculating the degree of association between keywords according to claim 8 is a method for calculating the degree of association between keywords based on cooperation between a management server, a plurality of first distributed processing servers, and a second distributed processing server. The management server distributes a plurality of document files stored in the document storage means to each first distributed processing server, and each first distributed processing server sends the responsible document file transmitted by the management server. Extracting keywords from the server and transmitting the keywords to the management server, and after the management server stores the keywords transmitted from the first distributed processing servers in the keyword storage means, all the keywords are sent to the first distributed processing server. And the first distributed processing server that has received the request for each document file detects the presence or absence of each keyword for the assigned document file. The step of calculating the square value of the appearance frequency of the keyword with the appearance record, describing the document frequency in the appearance frequency square value file for each document file, and the character code of the first character between the pair of keywords with the appearance record A step of generating a combination of keywords in which the younger one is arranged first, and for each combination, the product of the appearance frequencies of each keyword is calculated as a product value of the appearance frequencies between the pair of keywords for each pair of keywords. A combination frequency product value file to be described by comparing the character code of the first character of the first keyword with the character code ranges in charge of a plurality of combination frequency product value files to which a character code range is assigned in advance And the product value of the appearance frequency between keywords related to each combination is stored in the corresponding combination frequency product value file as a document file. A step of describing each file, a step of transmitting the appearance frequency square value file and a plurality of types of combination frequency product value files to the management server, and a plurality of occurrences transmitted from each first distributed processing server by the management server A step of transmitting a frequency square value file to the second distributed processing server and a plurality of types of combination frequency product value files transmitted from each first distributed processing server for each type of combination frequency product value file. A distribution process to the first distributed processing server determined, and a second distributed processing server to which the plurality of appearance frequency square value files are transmitted from the management server connect the appearance frequency square value files; Sorting each keyword described in this concatenated file according to each character code The step of calculating the square value of the current frequency, calculating the total over all document files, the step of transmitting the total to the management server, and the management server of the appearance frequency transmitted from the second distributed processing server The step of storing the sum of the square values in the keyword frequency sum table storage means, and the step of the first distributed processing server to which the plurality of combination frequency product value files are transmitted from the management server connecting each combination frequency product value file And sorting the combinations of the keywords described in the linked file according to the character codes of the keywords, summing up the product values of the appearance frequencies in units of combinations of the same keywords, and summing up all the document files. A step of calculating, a step of transmitting the sum to the management server, and the management server sending the product value transmitted from each first distributed processing server The sum, and storing the keyword combination frequency sum table storage means, retrieving a pair of keywords from the keyword storage unit, from the keyword combination frequency sum table storage means for the pair of keywords, the appearance between the keywords A step of taking out a sum of product values of frequencies, a step of taking out a sum of squares of appearance frequencies of each keyword for the pair of keywords from the keyword frequency sum table storing means , and calculating a square root of the sum, respectively. And calculating the degree of association between the two keywords by adding the square roots and dividing the sum of the product values of the appearance frequencies between the keywords by this sum.
Note that the order of the above steps may be appropriately changed as long as no logical contradiction occurs.

請求項１に記載したキーワード間の関連度算出システムによれば、まず文書ファイル単位で、出現頻度がゼロのため他のキーワードとの関連度算出が不要なキーワードを排斥し、出現実績のあるキーワードに絞った上で、関連度算出の基礎となる出現頻度の二乗値や組合せ頻度の積値を算出した後、全文書ファイル単位に集計する手法を採用しているため、全文書ファイル中に登場する多数のキーワード間の関連度を極めて効率的に算出することができる。この結果、特許文献１の検索システムのように、関連度の算出に先立ってキーワードの組合せの一方を企業名に限定することなく、あらゆる種類のキーワード相互間における関連度を算出することが可能となる。 According to the relevance calculation system between keywords described in claim 1, first, in a document file unit, keywords that do not need to be calculated with other keywords because the appearance frequency is zero are excluded, and keywords that have a history of appearance After calculating the square value of the appearance frequency and the product value of the combination frequency, which are the basis for calculating the relevance level, the method of summing up all document files is used, so it appears in all document files. The degree of association between a large number of keywords can be calculated very efficiently. As a result, as in the search system of Patent Document 1, it is possible to calculate the degree of association between all types of keywords without limiting one of the keyword combinations to the company name prior to the calculation of the degree of association. Become.

また、新規の文書ファイルが文書記憶手段に追加された場合でも、当該新規文書ファイル単位で(1)、(2)、(4)の処理を行い、この算出結果を(3)及び(5)の既存の集計値（総和）に加算した後、(6)及び(7)の計算をやり直すだけで済み、文書ファイル追加時における関連度の再計算処理が容易化される利点がある。
さらに、古くなった文書ファイルの影響を排除する必要がある場合にも、当該旧文書ファイルに係る(2)及び(4)の値を(3)及び(5)の集計値（総和）から減算した後、(6)及び(7)の計算をやり直すだけで済むため、キーワード間の関連度を最新のものに維持することが容易となる。 In addition, even when a new document file is added to the document storage means, the processing of (1), (2), (4) is performed for each new document file, and the calculation results are converted into (3) and (5). After adding to the existing total value (sum), it is only necessary to redo the calculations of (6) and (7), and the recalculation processing of the relevance when adding a document file is facilitated.
In addition, when it is necessary to eliminate the influence of an old document file, the values of (2) and (4) related to the old document file are subtracted from the total value (sum) of (3) and (5). After that, since it is only necessary to redo the calculations of (6) and (7), it becomes easy to keep the degree of association between keywords up to date.

請求項２〜７に記載したキーワード間の関連度算出システム及び請求項８に記載したキーワード間の関連度算出方法によれば、複数の分散処理サーバにより、キーワードの抽出処理や、各キーワードの出現頻度の二乗値ファイルの生成処理、出現頻度二乗値の全文書に亘る総和算出処理、キーワード間の出現頻度の積値ファイルの生成処理、出現頻度積値の全文書に亘る総和算出処理の少なくとも一部が分散化される結果、キーワード間の関連度算出に係る全体の計算処理を大幅に高速化することが可能となる。 According to the relevance calculation system between keywords described in claims 2 to 7 and the relevance calculation method between keywords described in claim 8, keyword extraction processing and appearance of each keyword are performed by a plurality of distributed processing servers. At least one of frequency square value file generation processing, appearance frequency square value summation calculation processing over all documents, appearance frequency product value file generation processing between keywords, and appearance frequency product value summation calculation processing over all documents As a result, the overall calculation processing related to the calculation of the degree of association between keywords can be greatly speeded up.

図１は、この発明に係る第１のキーワード間の関連度算出システム10及びこれを備えた第１の検索システム11の機能構成を示すブロック図であり、文書ＤＢ12と、キーワード抽出部14と、キーワードＤＢ16と、関連度算出部18と、キーワード共起頻度表ＤＢ20と、キーワード組合せ頻度総和表ＤＢ22と、キーワード頻度総和表ＤＢ24と、キーワード関連度表ＤＢ26と、固有名詞ＤＢ28と、検索処理部30とを備えている。 FIG. 1 is a block diagram showing the functional configuration of the first keyword relevance calculating system 10 and the first search system 11 having the same according to the present invention. The document DB 12, the keyword extracting unit 14, Keyword DB 16, relevance calculation unit 18, keyword co-occurrence frequency table DB 20, keyword combination frequency sum table DB 22, keyword frequency sum table DB 24, keyword relevance table DB 26, proper noun DB 28, and search processing unit 30 And.

上記のキーワード抽出部14、関連度算出部18及び検索処理部30は、コンピュータのCPUが、ＯＳ及び専用のアプリケーションプログラムに従い、必要な処理を実行することによって実現される。 The keyword extraction unit 14, the relevance calculation unit 18, and the search processing unit 30 are realized by the CPU of the computer executing necessary processing according to the OS and a dedicated application program.

上記の文書ＤＢ12、キーワードＤＢ16、キーワード共起頻度表ＤＢ20、キーワード組合せ頻度総和表ＤＢ22、キーワード頻度総和表ＤＢ24、キーワード関連度表ＤＢ26及び固有名詞ＤＢ28は、同コンピュータのハードディスクに格納されている。
文書ＤＢ12には、新聞記事や学術雑誌、論文等の文書ファイル（テキストデータ）が予め多数蓄積されている。また、固有名詞ＤＢ28には、企業名、商品名、サービス名、人物名等の固有名詞がカテゴリ別に多数登録されている。 The document DB 12, the keyword DB 16, the keyword co-occurrence frequency table DB 20, the keyword combination frequency sum table DB 22, the keyword frequency sum table DB 24, the keyword relevance table DB 26, and the proper noun DB 28 are stored in the hard disk of the computer.
A large number of document files (text data) such as newspaper articles, academic journals, and papers are stored in the document DB 12 in advance. In the proper noun DB 28, a number of proper nouns such as company names, product names, service names, and person names are registered for each category.

上記のキーワード抽出部14は、図２に示すように、係り受け表現抽出フィルタ32、区切り文字抽出フィルタ34、文字列頻度統計フィルタ36、TermExtractフィルタ38、多数決フィルタ40を備えている。 As shown in FIG. 2, the keyword extraction unit 14 includes a dependency expression extraction filter 32, a delimiter extraction filter 34, a character string frequency statistical filter 36, a TermExtract filter 38, and a majority decision filter 40.

つぎに、図３のフローチャートに従い、キーワード抽出部14によるキーワード抽出工程について説明する。
まずキーワード抽出部14は、文書ＤＢ12内に蓄積された各文書ファイルに係り受け表現抽出フィルタ32を適用し、各文書ファイルから所定の係り受け表現を備えた文字列を抽出する（Ｓ10）。
すなわち、係り受け表現抽出フィルタ32には、「○○メーカー」、「○○が主力」、「○○を生産」という係り受け表現パターンが予め多数用意されており、キーワード抽出部14は、これに当てはまる表現パターンを検出した後、「○○」に相当する文字列をキーワード候補として抽出する。 Next, the keyword extraction process by the keyword extraction unit 14 will be described with reference to the flowchart of FIG.
First, the keyword extraction unit 14 applies a dependency expression extraction filter 32 to each document file stored in the document DB 12, and extracts a character string having a predetermined dependency expression from each document file (S10).
That is, the dependency expression extraction filter 32 is provided with a large number of dependency expression patterns “XX manufacturer”, “XX is the main force”, and “XX is produced” in advance. After the expression pattern that applies to is detected, a character string corresponding to “XX” is extracted as a keyword candidate.

つぎにキーワード抽出部14は、各文書ファイルに区切り文字抽出フィルタ34を適用し、「○○」、"○○"、（○○）、［○○］、,○○,のように、カンマや括弧、スペース、タブ等の区切り文字で囲まれた○○の部分をキーワード候補として抽出する（Ｓ12）。 Next, the keyword extraction unit 14 applies a delimiter extraction filter 34 to each document file, such as “XX”, “XX”, (XX), [XX], XX, and so on. The part of XX surrounded by delimiters such as parentheses, spaces, tabs, etc. is extracted as a keyword candidate (S12).

つぎにキーワード抽出部14は、各文書ファイルに文字列頻度統計フィルタ36を適用し、各文書ファイルに含まれる各文字列が他の文書も含めて何回登場するのかを集計し、一定範囲の出現頻度を備えた文字列をキーワード候補として抽出する（Ｓ14）。
まず文字列頻度統計フィルタ36は、図４に示すように、文書中の名詞（ここでは「ＤＶＤ」）に注目し、このＤＶＤという注目語が文書ＤＢ12内に蓄積された各文書ファイル中に出現する数を集計する。つぎに、文字列頻度統計フィルタ36は、この注目語の前後の形態素に範囲を拡張し、それぞれの全文書中に登場する頻度を集計し、出現頻度が一定以下（例えば20以下）となった時点で文字範囲拡張を停止する。 Next, the keyword extraction unit 14 applies the character string frequency statistical filter 36 to each document file, and counts how many times each character string included in each document file appears, including other documents. A character string having an appearance frequency is extracted as a keyword candidate (S14).
First, as shown in FIG. 4, the character string frequency statistical filter 36 pays attention to a noun (here, “DVD”) in the document, and the attention word “DVD” appears in each document file stored in the document DB 12. Add up the number you want. Next, the character string frequency statistical filter 36 expands the range to the morpheme before and after this attention word, totals the frequency that appears in all the documents, and the appearance frequency becomes less than a certain value (for example, 20 or less). Stop character range expansion at this point.

例えば、ＤＶＤの一つ前の形態素を含む「したＤＶＤ」の出現頻度は「２」と低いため、これ以上前の形態素に範囲が拡張されることはない。これに対し、ＤＶＤの一つ後の形態素を含む「ＤＶＤレコーダー」の出現頻度は「８６２」と多いため、その一つ後の形態素を含む「ＤＶＤレコーダーでは」の出現頻度を集計する。そして、この出現頻度は「５」と低いため、これ以降の形態素に範囲を拡張することが停止される。 For example, since the appearance frequency of “done DVD” including the previous morpheme of the DVD is as low as “2”, the range is not expanded to the previous morpheme. On the other hand, since the appearance frequency of “DVD recorder” including the next morpheme of DVD is as many as “862”, the appearance frequencies of “DVD recorder” including the next morpheme are tabulated. Since the appearance frequency is as low as “5”, the expansion of the range to subsequent morphemes is stopped.

つぎに文字列頻度統計フィルタ36は、「ＤＶＤ」及び「ＤＶＤレコーダー」が所定範囲（例えば20〜5,000）内の出現頻度を備えていることを理由にキーワード候補として抽出する。これに対し、「したＤＶＤ」及び「ＤＶＤレコーダーでは」は上記の範囲外であるため、キーワード候補から除外される。
全文書中における出現頻度が20未満のものはそもそも重要語とはいえず、また5,000を越えるものは逆に特徴のない汎用語あるいは一般語と考えられるからであるが、この範囲設定は文書ファイルの分量や検索システムの使用目的に応じて適宜調整される。 Next, the character string frequency statistical filter 36 extracts “DVD” and “DVD recorder” as keyword candidates because they have an appearance frequency within a predetermined range (for example, 20 to 5,000). On the other hand, “done DVD” and “in the DVD recorder” are out of the above range, and are excluded from keyword candidates.
This is because, if the frequency of occurrence is less than 20 in all documents, it is not an important word in the first place, and if it exceeds 5,000, it is considered to be a general word or general word without features. The amount is adjusted as appropriate according to the amount of use and the purpose of the search system.

ところで、文書ＤＢ12内に蓄積された多量の文書ファイルに含まれる各文字列に関して、それぞれの出現頻度を集計するには膨大な時間を要するため、図５に示すように、文書ＤＢ12内には予め全文書ファイルに登場する各形態素が、個々の文書ファイル中に存在しているか否かを一覧表にまとめたインデックス（所謂転置インデックス）が生成されている。このため、キーワード抽出部14はこのインデックスを参照することにより、比較的短時間でその出現頻度を取得することが可能となる。 By the way, since it takes a lot of time to count the appearance frequency of each character string included in a large amount of document files stored in the document DB 12, as shown in FIG. An index (so-called transposed index) is generated that summarizes whether each morpheme appearing in all document files exists in each document file. For this reason, the keyword extracting unit 14 can acquire the appearance frequency in a relatively short time by referring to the index.

つぎにキーワード抽出部14は、文書ＤＢ12内に蓄積された文書ファイルにTermExtractフィルタ38を適用し、各文書ファイルから所定以上のスコアを備えた文字列をキーワード候補として抽出する（Ｓ16）。
このTermExtractは、専門分野のコーパス（主として研究目的で収集され、電子化された自然言語の文章からなる巨大なテキストデータ）から専門用語を自動抽出するために案出された文字列抽出アルゴリズムであり、文書ファイル中から単名詞及び複合名詞を候補語として抽出し、各候補語の出現頻度と連接頻度に基づいてそれぞれの重要度を算出する機能を備えている。このTermExtract自体は公知技術であるため、これ以上の説明は省略する。 Next, the keyword extraction unit 14 applies the TermExtract filter 38 to the document files stored in the document DB 12, and extracts a character string having a score equal to or higher than a predetermined value from each document file as a keyword candidate (S16).
This TermExtract is a string extraction algorithm devised to automatically extract technical terms from a specialized corpus (a huge text data consisting mainly of natural language sentences collected mainly for research purposes). A function is provided for extracting single nouns and compound nouns from the document file as candidate words and calculating the respective importance based on the appearance frequency and the connection frequency of each candidate word. Since this TermExtract itself is a known technique, further explanation is omitted.

つぎにキーワード抽出部14は、係り受け表現抽出フィルタ32、区切り文字抽出フィルタ34、文字列頻度統計フィルタ36、TermExtractフィルタ38によって抽出された各キーワード候補を多数決フィルタ40に入力し、キーワードを絞り込む。
多数決フィルタ40は、各フィルタによってリストアップされたキーワード候補同士をマッチングし、２以上のフィルタによってキーワード候補として挙げられているものを最終的なキーワードと認定し、キーワードＤＢ16に格納する（Ｓ18）。 Next, the keyword extraction unit 14 inputs each keyword candidate extracted by the dependency expression extraction filter 32, the delimiter extraction filter 34, the character string frequency statistical filter 36, and the TermExtract filter 38 to the majority filter 40, and narrows down the keywords.
The majority decision filter 40 matches the keyword candidates listed by each filter, recognizes those listed as keyword candidates by two or more filters as final keywords, and stores them in the keyword DB 16 (S18).

このように、係り受け表現抽出フィルタ32、区切り文字抽出フィルタ34、文字列頻度統計フィルタ36、TermExtractフィルタ38の４つのフィルタを用いることにより、文書ファイルからキーワードを抽出する際に重要語が漏れ落ちることを防止すると共に、多数決フィルタ40を用いて絞り込むことにより、不要なキーワード（ノイズ）が混入することを防止できる。 As described above, by using the four filters of the dependency expression extraction filter 32, the delimiter extraction filter 34, the character string frequency statistical filter 36, and the TermExtract filter 38, important words are leaked when keywords are extracted from the document file. In addition to this, it is possible to prevent unnecessary keywords (noise) from being mixed by narrowing down using the majority filter 40.

上記のように４つのフィルタ中の２以上のフィルタによって選別されたキーワード候補を正式なキーワードと認定するのは一例であり、３以上のフィルタによって選別されることをキーワード認定の要件とすることもできる。
また、フィルタの数も上記に限定されるものではなく、他の有効なキーワード候補抽出フィルタをキーワード抽出部14に設けることもできる。 As described above, the keyword candidate selected by two or more of the four filters is recognized as an official keyword, and selection by three or more filters may be a requirement for keyword recognition. it can.
Further, the number of filters is not limited to the above, and other effective keyword candidate extraction filters may be provided in the keyword extraction unit 14.

つぎに、図６のフローチャートに従い、関連度算出部18による各キーワード間の関連度算出工程について説明する。
まず関連度算出部18は、各キーワードの各文書ファイル中における共起頻度を集計してキーワード共起頻度表を生成し、キーワード共起頻度表ＤＢ20に格納する（Ｓ20）。
図７は、キーワード共起頻度表ＤＢ20に格納されたキーワード共起頻度表の具体例を示すものであり、文書ＤＢ12に格納された各文書D1〜Dnごとに、各キーワードKW-1〜nの出現頻度が記述されている。 Next, according to the flowchart of FIG. 6, the relevance calculation process between the keywords by the relevance calculation unit 18 will be described.
First, the relevance calculating unit 18 generates a keyword co-occurrence frequency table by counting the co-occurrence frequencies of each keyword in each document file, and stores the keyword co-occurrence frequency table in the keyword co-occurrence frequency table DB 20 (S20).
FIG. 7 shows a specific example of the keyword co-occurrence frequency table stored in the keyword co-occurrence frequency table DB20. For each document D1-Dn stored in the document DB 12, each keyword KW-1 to n is stored. Appearance frequency is described.

ここで、あるキーワードＸとＹとの間の関連度は、数１のiにキーワード共起頻度表ＤＢ20に記載されたＸとＹの出現頻度を代入することにより、理論的には算出可能である。

Here, the degree of association between a certain keyword X and Y can be theoretically calculated by substituting the appearance frequency of X and Y described in the keyword co-occurrence frequency table DB20 into i of Equation 1. is there.

この数１の分子は、キーワードＸ、Ｙの文書毎の出現頻度の積の全文書に亘る総和を意味するため、Ｘ、Ｙが同じ文書に出現する頻度が高いほど値は大きくなる。もっとも、特定の文書中におけるＸ及びＹの出現頻度の絶対数が多ければそれにつられて分子の値は高くなってしまい、必ずしもＸとＹの共起性の高さを表しているとはいえない。これに対し分母は、キーワードＸ、Ｙの文書毎の出現頻度の二乗の全文書に亘る総和の平方根同士を加算したものであり、Ｘ、Ｙの特定文書中の出現頻度が高いほど値が大きくなる。このため、分子の値を分母の値で除算することにより、特定文書中におけるＸ、Ｙの出現頻度の絶対数が多いことの影響を排除し、Ｘ、Ｙ間の共起性の高さに基づく関連度を導くことが可能となる。 Since the numerator of Equation 1 means the sum of the products of the appearance frequencies of the keywords X and Y for all documents, the value increases as the frequency of occurrence of X and Y in the same document increases. However, if the absolute number of occurrence frequencies of X and Y in a specific document is large, the value of the numerator increases accordingly, and it does not necessarily indicate the high co-occurrence of X and Y. . On the other hand, the denominator is obtained by adding the square roots of the sums of all the squares of the appearance frequencies of the keywords X and Y for each document, and the value increases as the appearance frequency in the specific document of X and Y increases. Become. For this reason, by dividing the numerator value by the denominator value, the influence of the large number of occurrence frequencies of X and Y in a specific document is eliminated, and the co-occurrence between X and Y is increased. It is possible to derive the degree of relevance based on it.

ただし、単純に数１の計算を行うやり方では、文書ファイルの分量及びキーワードの総数が多い場合には膨大な計算量が発生し、多くの処理時間を要することとなる。
そこで、この実施の形態では、キーワード共起頻度表に基づいてキーワード組合せ頻度総和表及びキーワード頻度総和表を生成することにより、計算工程の簡素化を図っている。 However, in the method of simply performing the calculation of Equation 1, if the amount of document files and the total number of keywords are large, a huge amount of calculation occurs, and a lot of processing time is required.
Therefore, in this embodiment, the calculation process is simplified by generating the keyword combination frequency summation table and the keyword frequency summation table based on the keyword co-occurrence frequency table.

図８は、その要領を例示するものである。この場合、キーワード共起頻度表にはキーワードKW-1〜KW-5の文書D1における出現頻度が記載されているが、この中KW-3及びKW-4の出現頻度は０であるため、実際に関連度を算出すべきキーワードの組合せは以下の３パターンで済むこととなる。
（KW-1, KW-2）、（KW-1, KW-5）、（KW-2, KW-5）
つぎに関連度算出部18は、各組合せ毎に出現頻度を乗じた値を記述したキーワード組合せ頻度総和表と、各キーワードの出現頻度を二乗した値を記述したキーワード頻度総和表を生成し、キーワード組合せ頻度総和表ＤＢ22及びキーワード頻度総和表ＤＢ24に格納する（Ｓ22、Ｓ24）。 FIG. 8 illustrates the procedure. In this case, the keyword co-occurrence frequency table describes the appearance frequencies of the keywords KW-1 to KW-5 in the document D1, but the KW-3 and KW-4 appearance frequencies are 0. The combination of keywords for which the degree of relevance should be calculated is the following three patterns.
(KW-1, KW-2), (KW-1, KW-5), (KW-2, KW-5)
Next, the relevance calculation unit 18 generates a keyword combination frequency sum table describing values multiplied by the appearance frequency for each combination, and a keyword frequency sum table describing values obtained by squaring the appearance frequency of each keyword. The combined frequency total table DB22 and the keyword frequency total table DB24 are stored (S22, S24).

図８のキーワード組合せ頻度総和表では、文書D1についての値のみが記述されているが、同様の処理を各文書毎に実行し、その結果に基づいて値を加算していくことにより、各キーワードの値が数１の分子に相当する結果となる。
同じく、図８のキーワード頻度総和表では、文書D1についての値のみが記述されているが、各文書における各キーワードの出現頻度を二乗した値を集計していき、各キーワードの最終的な値（総和）の平方根を求めることにより、数１の分母に相当する値が得られることになる。 In the keyword combination frequency summation table of FIG. 8, only the value for the document D1 is described. However, the same processing is executed for each document, and the values are added based on the result. Is equivalent to the numerator of Equation 1.
Similarly, in the keyword frequency total table of FIG. 8, only the value for the document D1 is described, but the value obtained by squaring the appearance frequency of each keyword in each document is aggregated, and the final value ( By calculating the square root of (sum), a value corresponding to the denominator of Equation 1 is obtained.

最後に関連度算出部18は、図９に示すように、キーワード組合せ頻度総和表ＤＢ22からキーワードＸ，Ｙの組合せ頻度の総和を読み込むと共に、キーワード頻度総和表ＤＢ24からキーワードＸの二乗値の総和とキーワードＹの二乗値の総和を読み込み、各二乗値の総和の平方根を求めた後、これらの値を数１に代入することにより、キーワードＸ，Ｙ間の関連度を算出し、キーワード関連度表ＤＢ26に格納する（Ｓ26）。すべてのキーワードの組合せについて処理が終了するまで、関連度算出部18は処理を繰り返す。 Finally, as shown in FIG. 9, the degree-of-relevance calculation unit 18 reads the sum of the combination frequencies of the keywords X and Y from the keyword combination frequency sum table DB22, and the sum of the square values of the keywords X from the keyword frequency sum table DB24. After reading the sum of the square values of the keyword Y and calculating the square root of the sum of the square values, substituting these values into Equation 1 calculates the relevance between the keywords X and Y, and the keyword relevance table Store in DB26 (S26). The degree-of-association calculation unit 18 repeats the process until the process is completed for all keyword combinations.

上記のように、文書ファイル毎に各キーワード間の組合せパターンを抽出し、それぞれの積値及び各キーワードの二乗値を求めた上で、各文書ファイルの値を加算していくことにより、出現頻度が０のキーワードに係る計算処理を省くことが可能となる。
このため、特許文献１の検索システムのように企業名に限定することなく、全キーワード間における関連度を算出することが現実的となる。 As described above, the combination pattern between each keyword is extracted for each document file, the product value and the square value of each keyword are obtained, and then the value of each document file is added, so that the appearance frequency This makes it possible to omit the calculation processing related to the keyword with 0.
For this reason, it is practical to calculate the relevance between all keywords without being limited to the company name as in the search system of Patent Document 1.

また、文書ＤＢ12に新規の文書ファイルが追加された場合には、この新規文書ファイル中の各キーワードに係る値を、キーワード組合せ頻度総和表ＤＢ22及びキーワード頻度総和表ＤＢ24に格納された既存の集計値に加算することによって、簡単にキーワード間の関連度が再計算可能となる。
古くなった文書ファイルの影響を排除する場合にも、当該文書ファイル中の各キーワードに係る値をキーワード組合せ頻度総和表ＤＢ22及びキーワード頻度総和表ＤＢ24に格納された既存の集計値から減算することによって、簡単にキーワード間の関連度を最新の状態に維持することが可能となる。 In addition, when a new document file is added to the document DB 12, values related to each keyword in the new document file are stored in the existing total values stored in the keyword combination frequency summation table DB22 and the keyword frequency summation table DB24. By adding to, it is possible to easily recalculate the degree of association between keywords.
Even when the influence of an obsolete document file is excluded, the value related to each keyword in the document file is subtracted from the existing total value stored in the keyword combination frequency summation table DB22 and the keyword frequency summation table DB24. Thus, it is possible to easily maintain the degree of association between keywords in the latest state.

つぎに、図１０のフローチャートに従い、このシステム10における検索処理手順について説明する。
まずユーザが端末装置αから検索語を入力すると、これを受け付けた検索処理部30は（Ｓ40）、図１１に示すように、キーワード関連度表ＤＢ26を参照し、当該検索語と同一または一定範囲内の類似性を有するキーワードを特定すると共に、当該キーワードに対して所定以上の関連度を有するキーワードのリストを抽出する（Ｓ42）。
つぎに検索処理部30は、固有名詞ＤＢ28の中の例えば企業名ＤＢを参照し、上記リスト中に含まれる企業名を抽出する（Ｓ44）。
この抽出された企業名のリストは、検索語に関連の深い企業リストとして端末装置αに送信される（Ｓ46）。 Next, a search processing procedure in the system 10 will be described with reference to the flowchart of FIG.
First, when the user inputs a search word from the terminal device α, the search processing unit 30 that has received the search word (S40) refers to the keyword relevance table DB 26 as shown in FIG. A keyword having a similarity is specified, and a list of keywords having a predetermined degree of relevance to the keyword is extracted (S42).
Next, the search processing unit 30 refers to, for example, the company name DB in the proper noun DB 28 and extracts the company name included in the list (S44).
The extracted list of company names is transmitted to the terminal device α as a company list closely related to the search term (S46).

この結果ユーザは、入力した検索語（例えば時事用語）と関連の深い企業を認識することが可能となり、投資行動の判断材料に利用することができる。
また、固有名詞ＤＢ28として人物名ＤＢを指定すれば、入力した検索語と関連の深い人物をピックアップできる。 As a result, the user can recognize a company closely related to the input search word (for example, current affair term), and can use it for the judgment of investment behavior.
If a person name DB is designated as the proper noun DB 28, a person closely related to the input search word can be picked up.

もっとも、企業名ＤＢや人物名ＤＢとのマッチングを行うことなく、検索語と関連の深いキーワードのリストを、そのまま端末装置αに返すようにしてもよい。
この後、ユーザがキーワードリスト中の特定のキーワードを検索語として指定すると、そのキーワードと所定以上の関連性を備えたキーワードのリストが検索処理部30によってさらに抽出され、端末装置αに送信される。
この結果、ユーザは関連語から関連語へと、連鎖的に検索範囲を広げていくことが可能となり、予想外のキーワードに辿り着くことが期待できる。 However, a list of keywords closely related to the search term may be returned to the terminal device α as it is without matching with the company name DB or the person name DB.
Thereafter, when the user designates a specific keyword in the keyword list as a search word, the search processing unit 30 further extracts a list of keywords having a predetermined relationship with the keyword and transmits it to the terminal device α. .
As a result, the user can expand the search range in a chain from related words to related words, and can be expected to arrive at an unexpected keyword.

ユーザが検索結果リスト中の特定のキーワードを指定し、その根拠となる文書の提示をリクエストすると、これを受け付けた検索処理部は（Ｓ48）、図１２に示すように、検索語及び当該キーワードに基づいてキーワード共起頻度表ＤＢ20を検索し、両者間で共起の生じている文書番号のリストを生成する（Ｓ50）。
つぎに検索処理部30は、この文書番号リストに基づいて文書ＤＢ12を検索し、文書本文のリストを生成した後、端末装置αに送信する（Ｓ52、Ｓ54）。
この結果、端末装置αのディスプレイには、検索語と当該キーワードとが同時に出現している文書の番号、タイトル、抄録、年月日等がリスト表示される。 When the user designates a specific keyword in the search result list and requests the presentation of a document as the basis thereof (S48), the search processing unit that accepts the request (S48) assigns the search word and the keyword as shown in FIG. Based on this, the keyword co-occurrence frequency table DB 20 is searched to generate a list of document numbers in which co-occurrence occurs between the two (S50).
Next, the search processing unit 30 searches the document DB 12 based on the document number list, generates a list of document texts, and transmits it to the terminal device α (S52, S54).
As a result, the number, title, abstract, date, etc. of the document in which the search word and the keyword appear simultaneously are displayed in a list on the display of the terminal device α.

また、この中の一つをユーザが選択すると、検索処理部30は該当の文書ファイルを文書ＤＢ12から抽出し、端末装置αに送信する。
この結果ユーザは、当該文書ファイルの内容を閲覧し、検索語とキーワードとの関連性を個別に確認することが可能となる。 When the user selects one of these, the search processing unit 30 extracts the corresponding document file from the document DB 12 and transmits it to the terminal device α.
As a result, the user can browse the contents of the document file and individually confirm the relevance between the search word and the keyword.

図１３は、この発明に係る第２のキーワード間の関連度算出システム50を備えた第２の検索システム52を示す概念図である。第２のキーワード間の関連度算出システム50は、管理サーバ54と、３台の第１の分散処理サーバ56a〜56cと、第２の分散処理サーバ57を備えており、管理サーバ54と第１の分散処理サーバ56a〜56c及び第２の分散処理サーバ57との協働によって、キーワード関連度表の作成を効率化・高速化することを企図している。 FIG. 13 is a conceptual diagram showing a second search system 52 provided with a system for calculating the degree of association between second keywords according to the present invention. The relationship calculation system 50 between the second keywords includes a management server 54, three first distributed processing servers 56a to 56c, and a second distributed processing server 57. It is intended to make the creation of the keyword relevance table more efficient and faster by cooperating with the distributed processing servers 56a to 56c and the second distributed processing server 57.

管理サーバ54は、文書ＤＢ12と、キーワードＤＢ16と、キーワード組合せ頻度総和表ＤＢ22と、キーワード頻度総和表ＤＢ24と、キーワード関連度表ＤＢ26と、固有名詞ＤＢ28とを備えている。
管理サーバ54にはネットワークを介してWebサーバ58が接続されており、このWebサーバ58にはインターネット60を介して複数の端末装置αが接続されている。 The management server 54 includes a document DB 12, a keyword DB 16, a keyword combination frequency sum table DB 22, a keyword frequency sum table DB 24, a keyword relevance table DB 26, and a proper noun DB 28.
A Web server 58 is connected to the management server 54 via a network, and a plurality of terminal devices α are connected to the Web server 58 via the Internet 60.

以下、図１４及び図１５のフローチャートに従い、この第２の関連度算出システム50におけるキーワード関連度表の生成手順について説明する。
まず管理サーバ54は、図１６に示すように、文書ＤＢ12内に蓄積された多数の文書ファイル62を、第１の分散処理サーバ56a〜56cに対して分割配信する（図１４のＳ60）。この際、管理サーバ54は、第１の分散処理サーバ56a〜56cにおける処理の負荷がほぼ均等となるように、それぞれに配信する文書ファイル62a〜62cのデータ量を調整する。 A procedure for generating a keyword relevance table in the second relevance calculation system 50 will be described below with reference to the flowcharts of FIGS.
First, as shown in FIG. 16, the management server 54 divides and distributes a large number of document files 62 stored in the document DB 12 to the first distributed processing servers 56a to 56c (S60 in FIG. 14). At this time, the management server 54 adjusts the data amounts of the document files 62a to 62c to be distributed to the first distributed processing servers 56a to 56c so that the processing loads on the first distributed processing servers 56a to 56c become substantially equal.

つぎに、第１の分散処理サーバ56a〜56cにおいては、キーワード抽出処理部64a〜64cによって、分配された担当文書ファイル62a〜62cに対するキーワード抽出処理が実行される（Ｓ61）。
このキーワード抽出処理に際しては、上記と同様、各キーワード抽出処理部64a〜64c内に設けられた係り受け表現抽出フィルタ32、区切り文字抽出フィルタ34、文字列頻度統計フィルタ36、TermExtractフィルタ38、多数決フィルタ40を用いることにより、ノイズを排した適切な範囲のキーワードが抽出される。 Next, in the first distributed processing servers 56a to 56c, keyword extraction processing is performed on the distributed document files 62a to 62c by the keyword extraction processing units 64a to 64c (S61).
In the keyword extraction process, as described above, the dependency expression extraction filter 32, the separator character extraction filter 34, the character string frequency statistical filter 36, the TermExtract filter 38, the majority filter provided in each of the keyword extraction processing units 64a to 64c. By using 40, keywords in an appropriate range from which noise is eliminated are extracted.

なお、文字列頻度統計フィルタ36の適用に関しては、他の分散処理サーバが担当している文書中における注目語の出現頻度を参照する必要があるため、第１の分散処理サーバ56a〜56cは、管理サーバ54にこれを照会する。
これを受けた管理サーバ54は、文書ＤＢ12内に設けられた形態素インデックス（転置インデックス）を参照することにより、全文書中における当該注目語の出現頻度を取得し、その結果を照会元の第１の分散処理サーバ56に返す。 Regarding the application of the character string frequency statistical filter 36, since it is necessary to refer to the appearance frequency of the attention word in the document in charge of other distributed processing servers, the first distributed processing servers 56a to 56c The management server 54 is inquired about this.
Receiving this, the management server 54 refers to the morpheme index (transposed index) provided in the document DB 12 to acquire the appearance frequency of the attention word in all documents, and the result is the first query source. To the distributed processing server 56.

自己に割り当てられた担当文書ファイル62a〜62cに対するキーワードの抽出処理を完了した第１の分散処理サーバ56a〜56cは、抽出したキーワードを管理サーバ54に送信する（Ｓ62）。
管理サーバ54は、第１の分散処理サーバ56a〜56cから受信したキーワードをキーワードＤＢ16に登録する（Ｓ63）。この際、第１の分散処理サーバ56a〜56cから同一のキーワードが重複して送信された場合、その中の一つがキーワードＤＢ16に登録される。 The first distributed processing servers 56a to 56c that have completed the keyword extraction processing for the assigned document files 62a to 62c assigned to themselves transmit the extracted keywords to the management server 54 (S62).
The management server 54 registers the keywords received from the first distributed processing servers 56a to 56c in the keyword DB 16 (S63). At this time, when the same keyword is repeatedly transmitted from the first distributed processing servers 56a to 56c, one of them is registered in the keyword DB 16.

つぎに管理サーバ54は、図１７に示すように、第１の分散処理サーバ56a〜56cに対してキーワードＤＢ16に登録された全キーワードのデータ66を送信する（Ｓ64）。 Next, as shown in FIG. 17, the management server 54 transmits the data 66 of all the keywords registered in the keyword DB 16 to the first distributed processing servers 56a to 56c (S64).

図１７においては、図示の便宜上、第１の分散処理サーバ56bに対してキーワードデータ66が送信された様子が描かれているが、他の第１の分散処理サーバ56a及び56cにも同じキーワードデータ66が送信される。
同様に、以下においては第１の分散処理サーバ56bを中心に各種処理について説明するが、他の第１の分散処理サーバ56a及び56cにおいても同様の処理が実行される。 In FIG. 17, for convenience of illustration, a state in which the keyword data 66 is transmitted to the first distributed processing server 56 b is illustrated, but the same keyword data is also applied to the other first distributed processing servers 56 a and 56 c. 66 is sent.
Similarly, in the following, various processes will be described with a focus on the first distributed processing server 56b, but similar processes are also executed in the other first distributed processing servers 56a and 56c.

管理サーバ54からキーワードデータ66を受信した第１の分散処理サーバ56bのファイル生成部68bは、まず自己に割り当てられた各文書ファイル62bについて、個々のキーワードの出現実績の有無を探知し、出現実績のあるキーワードについてはその出現頻度を文書ファイル単位で算出する（Ｓ65）。 The file generation unit 68b of the first distributed processing server 56b that has received the keyword data 66 from the management server 54 first detects the presence / absence of the appearance of individual keywords for each document file 62b allocated to itself, and The appearance frequency of a certain keyword is calculated for each document file (S65).

つぎにファイル生成部68bは、各キーワードの出現頻度を二乗した値を算出し、出現頻度二乗値ファイル70bに記述していく（Ｓ66）。
例えば、ある文書ファイル中に「さくら」が３回、「春」が５回、「鶯」が６回出現した場合、ファイル生成部68bは「さくら，９」、「春，２５」、「鶯，３６」というように、キーワードと出現頻度の二乗値との組合せからなる出現頻度二乗値データを、出現頻度二乗値ファイル70bに一行単位で記述していく。 Next, the file generation unit 68b calculates a value obtained by squaring the appearance frequency of each keyword, and describes it in the appearance frequency square value file 70b (S66).
For example, when “Sakura” appears 3 times, “Spring” 5 times, and “鶯” appears 6 times in a document file, the file generation unit 68b displays “Sakura, 9”, “Spring, 25”, “鶯, 36 ", the appearance frequency square value data composed of the combination of the keyword and the square value of the appearance frequency is described in the appearance frequency square value file 70b in units of one line.

つぎにファイル生成部68bは、当該文書ファイルに出現実績のある全キーワードについて、２つのキーワードからなる組合せを生成する（Ｓ67）。この際、ファイル生成部68bは一対のキーワードの中、先頭文字の文字コードが若い方のキーワードを１番目（左側）に配置させる。 Next, the file generating unit 68b generates a combination of two keywords for all keywords that have appeared in the document file (S67). At this time, the file generation unit 68b arranges the keyword with the younger character code of the first character among the pair of keywords in the first (left side).

例えば、ある文書ファイル中に「さくら」「春」「鶯」の３つのキーワードが存在したと仮定した場合、ファイル生成部68bはそれぞれの先頭文字のシフトJISコードを勘案し、「さくら，春」「さくら，鶯」「春，鶯」の組合せを生成する。因みに、「さくら」の「さ」の文字コードは「82B3」、「春」の文字コードは「8F74」、「鶯」の文字コードは「E9F2」である。 For example, assuming that there are three keywords “Sakura”, “Spring”, and “鶯” in a document file, the file generation unit 68b takes into account the shift JIS code of each leading character, and “Sakura, Spring”. Generates a combination of “Sakura, Samurai” and “Spring, Samurai”. Incidentally, the character code of “Sa” in “Sakura” is “82B3”, the character code of “Spring” is “8F74”, and the character code of “鶯” is “E9F2”.

つぎにファイル生成部68bは、各組合せのキーワード間で出現頻度を乗算し、キーワードの組合せとその積値とをセットにした組合せ頻度積値データを生成する（Ｓ68）。 Next, the file generation unit 68b multiplies the appearance frequency between the keywords of each combination, and generates combination frequency product value data in which the combination of the keyword and its product value are set (S68).

つぎにファイル生成部68bは、この組合せ頻度積値データを、１番目のキーワードの先頭文字の文字コードに対応した組合せ頻度積値ファイルに記述していく（Ｓ69）。
すなわち、この組合せ頻度積値ファイルとして、予め第１の分散処理サーバ56a〜56cの数に対応する３種類のファイル（第１の組合せ頻度積値ファイル72b、第２の組合せ頻度積値ファイル74b、第３の組合せ頻度積値ファイル76b）が、ファイル生成部68bによってディスク上に生成されている。 Next, the file generation unit 68b describes this combination frequency product value data in a combination frequency product value file corresponding to the character code of the first character of the first keyword (S69).
That is, as this combination frequency product value file, three types of files corresponding to the number of first distributed processing servers 56a to 56c (first combination frequency product value file 72b, second combination frequency product value file 74b, A third combination frequency product value file 76b) is generated on the disk by the file generation unit 68b.

また、各組合せ頻度積値ファイル72b、74b、76bには、文字コードの範囲が予め割り振られている。例えば、図１８(a)に示すシフトJISのコード体系を前提とした場合、同図(b)に示すように、第１の組合せ頻度積値ファイル72bには１バイト目が20〜DFの文字コード範囲が割り当てられている。また、第２の組合せ頻度積値ファイル74bには１バイト目が81〜8Eの文字コード範囲が、第３の組合せ頻度積値ファイル76bには１バイト目が8F〜9F及びE0〜EFの文字コード範囲がそれぞれ割り振られている。 In addition, character code ranges are allocated in advance to the combination frequency product value files 72b, 74b, and 76b. For example, assuming the Shift JIS code system shown in FIG. 18 (a), as shown in FIG. 18 (b), the first combination frequency product value file 72b has a character whose first byte is 20 to DF. A code range is assigned. The second combination frequency product value file 74b has a character code range of 81 to 8E in the first byte, and the third combination frequency product value file 76b has characters in the first byte of 8F to 9F and E0 to EF. Each code range is allocated.

したがって、「さくら，春，１５」及び「さくら，鶯，１８」の組合せ頻度積値データは、「さ」の文字コードが「82B3」であることから、第２の組合せ頻度積値ファイル74bに記述される。
これに対し、「春，鶯，３０」の組合せ頻度データは、「春」の文字コードが「8F74」であることから、第３の組合せ頻度積値ファイル76bに記述される。
また、「PCT，特許，２０」という組合せ頻度積値データが生成された場合、半角アルファベットを担当する第１の組合せ頻度積値ファイル72bに記述されることとなる。 Therefore, the combination frequency product value data of “Sakura, Spring, 15” and “Sakura, Samurai, 18” is “82B3” in the character code of “Sa”, and therefore is stored in the second combination frequency product value file 74b. Described.
On the other hand, the combination frequency data of “spring, 鶯, 30” is described in the third combination frequency product value file 76b because the character code of “spring” is “8F74”.
When the combination frequency product value data “PCT, patent, 20” is generated, it is described in the first combination frequency product value file 72b in charge of the half-width alphabet.

ファイル生成部68bは、上記のＳ65〜Ｓ69の処理を自己に割り当てられた全担当文書ファイル62bに対して実行する（Ｓ70）。
この過程で、同じキーワードに係る出現頻度二乗値データや、同じキーワードの組合せに係る組合せ頻度積値データが、異なる文書ファイル間で多数発生することが予想されるが、ファイル生成部68bはこの時点で各データの値を集計することはせず、各データを生成順に出現頻度二乗値ファイル及び対応の組合せ頻度積値ファイルに追記していく。 The file generation unit 68b executes the above-described processing of S65 to S69 on all assigned document files 62b assigned to itself (S70).
In this process, it is expected that a large number of appearance frequency square value data relating to the same keyword and combination frequency product value data relating to the same keyword combination will occur between different document files. The values of each data are not added up, and each data is added to the appearance frequency square value file and the corresponding combination frequency product value file in the order of generation.

上記の処理は、第１の分散処理サーバ56a〜56cにおいて独立して実行されており、それぞれが担当する文書ファイル62a〜62cについての処理が完了すると、第１の分散処理サーバ56a〜56cから管理サーバ54に対し、キーワード出現頻度二乗値ファイル70a〜70c、第１のキーワード組合せ頻度積値ファイル72a〜72c、第２のキーワード組合せ頻度積値ファイル74a〜74c、第３のキーワード組合せ頻度積値ファイル76a〜76cが送信される（図１５のＳ71）。 The above processing is executed independently in the first distributed processing servers 56a to 56c, and when the processing for the document files 62a to 62c in charge of each is completed, the processing is managed from the first distributed processing servers 56a to 56c. For the server 54, keyword appearance frequency square value files 70a to 70c, first keyword combination frequency product value files 72a to 72c, second keyword combination frequency product value files 74a to 74c, and third keyword combination frequency product value file 76a to 76c are transmitted (S71 in FIG. 15).

これに対し管理サーバ54は、第１の分散処理サーバ56a〜56cから送信された第１のキーワード組合せ頻度積値ファイル72a〜72c、第２のキーワード組合せ頻度積値ファイル74a〜74c、第３のキーワード組合せ頻度積値ファイル76a〜76cを、それぞれの担当に応じて第１の分散処理サーバ56a〜56cに振り分け配信する（Ｓ72）。 On the other hand, the management server 54 sends the first keyword combination frequency product value files 72a to 72c, the second keyword combination frequency product value files 74a to 74c, and the third information transmitted from the first distributed processing servers 56a to 56c. The keyword combination frequency product value files 76a to 76c are distributed and distributed to the first distributed processing servers 56a to 56c according to their respective responsibilities (S72).

例えば、図１９に示すように、第１の分散処理サーバ56bには第２の組合せ頻度積値ファイルが予め割り当てられているため、第１の分散処理サーバ56a〜56cによって生成された第２の組合せ頻度積値ファイル74a〜74cが管理サーバ54から配信される。 For example, as shown in FIG. 19, since the second distributed frequency product value file is assigned in advance to the first distributed processing server 56b, the second distributed processing servers 56a to 56c generate the second Combination frequency product value files 74 a to 74 c are distributed from the management server 54.

同様に、第１の組合せ頻度積値ファイルが割り当てられた第１の分散処理サーバ56aには、第１の分散処理サーバ56a〜56cによって生成された第１のキーワード組合せ頻度積値ファイル72a〜72cが、また第３の組合せ頻度積値ファイルが割り当てられた分散処理サーバ56cには、第１の分散処理サーバ56a〜56cによって生成された第３の組合せ頻度積値ファイル76a〜76cが配信される。
以下、第１の分散処理サーバ56bにおける処理を中心に説明するが、他の分散処理サーバ56a、56cにおいても同様の処理が実行される。 Similarly, first keyword combination frequency product value files 72a to 72c generated by the first distributed processing servers 56a to 56c are assigned to the first distributed processing server 56a to which the first combination frequency product value file is assigned. However, the third combination frequency product value files 76a to 76c generated by the first distributed processing servers 56a to 56c are distributed to the distributed processing server 56c to which the third combination frequency product value file is assigned. .
Hereinafter, although the description will be focused on the processing in the first distributed processing server 56b, the same processing is executed in the other distributed processing servers 56a and 56c.

まず第１の分散処理サーバ56bにおいては、ファイル結合部78bによって、３つの組合せ頻度積値ファイル74a〜74cが結合される（Ｓ73）。
つぎにソート処理部80bが起動し、結合ファイル82bに記述されたキーワードの組合せ（Ｘ、Ｙ）について、それぞれの文字コード順に整列させる（Ｓ74）。この結果、「さくら，春，２０」…「さくら，春，３２」…「さくら，春，２８」のように、同じキーワードの組合せを備えた組合せ頻度積値データが複数並ぶソート済みファイル86bが生成される。
つぎに加算処理部84bが起動し、ソート済みファイル86bに対し所謂コントロールブレイク処理を施し、同じキーワードの組合せ単位で積値を集計する（Ｓ75）。 First, in the first distributed processing server 56b, the three combination frequency product value files 74a to 74c are combined by the file combining unit 78b (S73).
Next, the sort processing unit 80b is activated, and the keyword combinations (X, Y) described in the combined file 82b are arranged in the order of their character codes (S74). As a result, a sorted file 86b in which a plurality of combination frequency product value data having the same keyword combination is arranged, such as “Sakura, Spring, 20”... “Sakura, Spring, 32”. Generated.
Next, the addition processing unit 84b is activated, so-called control break processing is performed on the sorted file 86b, and product values are tabulated in units of the same keyword combination (S75).

以上の結果、個々の文書ファイル中におけるキーワードＸ，Ｙの出現頻度の積値の、全文書ファイルに亘る総和（数１の分子に相当）が求まる。
この算出結果ファイル88bは、第１の分散処理サーバ56bから管理サーバ54に送信される（Ｓ76）。
これに対し管理サーバ54は、算出結果ファイル88bのデータを抽出し、キーワード組合せ頻度総和表ＤＢ22に登録する（Ｓ77）。すなわち、同じキーワードＸ，Ｙの組合せの値が既にキーワード組合せ頻度総和表に存在する場合、管理サーバ54は既存の値に結果の値を加算し、既存の値が存在しない場合にはキーワードＸ，Ｙとその値を新規に追加する。 As a result, the sum (corresponding to the numerator in Equation 1) of the product values of the appearance frequencies of the keywords X and Y in each document file is obtained over all document files.
The calculation result file 88b is transmitted from the first distributed processing server 56b to the management server 54 (S76).
On the other hand, the management server 54 extracts the data of the calculation result file 88b and registers it in the keyword combination frequency sum table DB22 (S77). That is, when the combination value of the same keyword X, Y already exists in the keyword combination frequency summation table, the management server 54 adds the result value to the existing value, and when the existing value does not exist, the keyword X, Y and its value are newly added.

上記と並行して、第２の分散処理サーバ57においても、所定の処理が実行される。すなわち、図２０に示すように、管理サーバ54から第２の分散処理サーバ57に対して、第１の分散処理サーバ56a〜56cから送信された出現頻度二乗値ファイル70a〜70cが送信される（Ｓ78）。 In parallel with the above, the second distributed processing server 57 also executes predetermined processing. That is, as shown in FIG. 20, the appearance frequency square value files 70a to 70c transmitted from the first distributed processing servers 56a to 56c are transmitted from the management server 54 to the second distributed processing server 57 ( S78).

これを受信した第２の分散処理サーバ57においては、ファイル結合部90によって、３つの出現頻度二乗値ファイル70a〜70cが結合される（Ｓ79）。
つぎにソート処理部91が起動し、結合ファイル92に記述された各キーワード及び二乗値を、文字コード順に整列させる（Ｓ80）。この結果、「さくら，１６」…「さくら，９」…「さくら，４」のように、同じキーワードが複数並ぶソート済みファイル93が生成される。
つぎに加算処理部94が起動し、同じキーワード単位で二乗値を集計する（Ｓ81）。 In the second distributed processing server 57 that has received this, the file combination unit 90 combines the three appearance frequency square value files 70a to 70c (S79).
Next, the sort processing unit 91 is activated, and the keywords and square values described in the combined file 92 are arranged in the order of character codes (S80). As a result, a sorted file 93 in which a plurality of the same keywords are arranged, such as “Sakura, 16”... “Sakura, 9”.
Next, the addition processing unit 94 is activated, and the square values are totaled in the same keyword unit (S81).

以上の結果、個々の文書中におけるキーワードの頻度頻度の二乗値の、全文書に亘る総和が求まる。
この算出結果ファイル95は、第２の分散処理サーバ57から管理サーバ54に送信される（Ｓ82）。
これに対し管理サーバ54は、算出結果ファイル95中の結果データを抽出し、キーワード頻度総和表ＤＢ24に登録する（Ｓ83 ）。すなわち、同じキーワードの値が既にキーワード頻度総和表に存在する場合、管理サーバ54は既存の値に結果の値を加算し、既存の値が存在しない場合にはキーワードとその値を新規に追加する。 As a result, the total sum of the square values of the frequency frequencies of the keywords in each document over all the documents is obtained.
The calculation result file 95 is transmitted from the second distributed processing server 57 to the management server 54 (S82).
In response to this, the management server 54 extracts the result data in the calculation result file 95 and registers it in the keyword frequency summation table DB 24 (S83). That is, when the value of the same keyword already exists in the keyword frequency summary table, the management server 54 adds the result value to the existing value, and adds the keyword and its value when the existing value does not exist. .

最後に管理サーバ54は、図９に示したように、キーワード組合せ頻度総和表ＤＢ22からキーワードＸ，Ｙの組合せ頻度の総和を読み込むと共に、キーワード頻度総和表ＤＢ24からキーワードＸの二乗値の総和とキーワードＹの二乗値の総和を読み込み、各二乗値の総和の平方根を求めた後、これらの値を数１に代入することにより、キーワードＸ，Ｙ間の関連度を算出し、キーワード関連度表ＤＢ26に登録する（Ｓ84）。すべてのキーワードの組合せについて処理が終了するまで、管理サーバ54は処理を繰り返す。 Finally, as shown in FIG. 9, the management server 54 reads the sum of the combination frequencies of the keywords X and Y from the keyword combination frequency summation table DB22, and the sum of the square values of the keyword X and the keyword from the keyword frequency summation table DB24. After reading the sum of the squared values of Y and obtaining the square root of the sum of the squared values, substituting these values into Equation 1 calculates the relevance between the keywords X and Y, and the keyword relevance table DB26 (S84). The management server 54 repeats the processing until the processing is completed for all keyword combinations.

この第２のキーワード間の関連度算出システム50の場合、上記のようにキーワードの抽出処理、関連度算出の前提となるキーワード組合せ頻度総和算出処理及びキーワード頻度総和算出処理が、第１の分散処理サーバ56a〜56c及び第２の分散処理サーバ57によって同時並行的に実行されるため、キーワード関連度表の生成速度を飛躍的に向上させることができる。 In the case of the second relatedness calculation system 50 between keywords, as described above, the keyword extraction process, the keyword combination frequency summation calculation process and the keyword frequency summation calculation process which are the premise of the relevance calculation are the first distributed process. Since the servers 56a to 56c and the second distributed processing server 57 are executed in parallel, the generation speed of the keyword relevance table can be dramatically improved.

しかも、第１の分散処理サーバ56a〜56cにおいてはファイル形式で算出結果のデータが保存されていき、データ保存のたびにデータベースへの書き込みが発生することがないため、全体的な処理速度を速めることができる。 In addition, the first distributed processing servers 56a to 56c store the calculation result data in a file format, and writing to the database does not occur each time the data is stored, so that the overall processing speed is increased. be able to.

以下、図２１のフローチャートに従い、第２の検索システム52における検索処理手順について説明する。
まずユーザが端末装置αから検索語を入力すると、Webサーバ58経由でこれを受け付けた管理サーバ54は（Ｓ90）、図１１に示したように、キーワード関連度表ＤＢ26を参照し、当該検索語と同一または一定範囲内の類似性を有するキーワードを特定すると共に、当該キーワードに対して所定以上の関連度を有するキーワードのリストを抽出する（Ｓ91）。 Hereinafter, the search processing procedure in the second search system 52 will be described with reference to the flowchart of FIG.
First, when the user inputs a search word from the terminal device α, the management server 54 that has received the search word via the Web server 58 (S90) refers to the keyword relevance table DB 26 as shown in FIG. A keyword having the same or similar similarity within a certain range is specified, and a list of keywords having a predetermined degree of relevance to the keyword is extracted (S91).

つぎに管理サーバ54は、固有名詞ＤＢ28の中の例えば企業名ＤＢを参照し、上記リスト中に含まれる企業名を抽出する（Ｓ92）。
この抽出された企業名のリスト（検索語に関連の深い企業リスト）は、Webサーバ58経由で端末装置αに送信される（Ｓ93）。 Next, the management server 54 refers to, for example, the company name DB in the proper noun DB 28 and extracts the company name included in the list (S92).
The extracted list of company names (list of companies closely related to the search term) is transmitted to the terminal device α via the Web server 58 (S93).

なお、各分散処理サーバに対する機能の割り振りは、上記した第２の関連度算出システム50の方式に限定されるものではない。
例えば、上記にあっては第１の分散処理サーバ56がキーワードの抽出処理、キーワードの出現頻度二乗値ファイルの生成処理、キーワード間の組合せ頻度積値ファイルの生成処理、組合せ頻度積値の全文書ファイルに亘る総和算出処理を担当しているが、各処理を他の複数の分散処理サーバからなるグループに分散させることもできる。 Note that the function allocation to each distributed processing server is not limited to the method of the second relevance calculation system 50 described above.
For example, in the above case, the first distributed processing server 56 performs keyword extraction processing, keyword appearance frequency square value file generation processing, inter-keyword combination frequency product value file generation processing, and all combination frequency product value documents. Although it is in charge of the total calculation process over the file, each process can be distributed to a group of a plurality of other distributed processing servers.

また、第２の分散処理サーバ57を複数設け、各キーワードの出現頻度二乗値の全文書ファイルに亘る総和算出処理を分散化させることもできる。この場合、その前提として、第２の分散処理サーバ57の数と同数の文字コード範囲に対応した複数種類のキーワード出現頻度二乗値ファイルを生成しておく必要があるが、この処理を各第２の分散処理サーバ57に割り当てることも当然に可能である。 Also, a plurality of second distributed processing servers 57 can be provided, and the sum calculation processing over all document files of the appearance frequency square value of each keyword can be distributed. In this case, as a precondition, it is necessary to generate a plurality of types of keyword appearance frequency square value files corresponding to the same number of character code ranges as the number of second distributed processing servers 57. Naturally, it is possible to assign to the distributed processing server 57.

この発明に係る第１のキーワード間の関連度算出システム及び第１の検索システムの機能構成を示すブロック図である。It is a block diagram which shows the function structure of the association degree calculation system between the 1st keywords and 1st search system which concern on this invention. キーワード抽出部の機能構成を示すブロック図である。It is a block diagram which shows the function structure of a keyword extraction part. キーワード抽出工程を示すフローチャートである。It is a flowchart which shows a keyword extraction process. 文字列頻度統計フィルタの動作を示す説明図である。It is explanatory drawing which shows operation | movement of a character string frequency statistical filter. 文書ＤＢ内に形態素インデックスが形成されている様子を示す説明図である。It is explanatory drawing which shows a mode that the morpheme index is formed in document DB. キーワード間の関連度算出工程を示すフローチャートである。It is a flowchart which shows the related degree calculation process between keywords. キーワード共起頻度表の一例を示す説明図である。It is explanatory drawing which shows an example of a keyword co-occurrence frequency table. 関連度算出処理を簡略化する方法を示す説明図である。It is explanatory drawing which shows the method of simplifying a relevance calculation process. キーワード組合せ頻度総和表及びキーワード頻度総和表に基づいてキーワード関連度表が生成される様子を示す説明図である。It is explanatory drawing which shows a mode that a keyword relevance table is produced | generated based on a keyword combination frequency total table and a keyword frequency total table. 検索処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a search process. 検索語に基づき企業名リストを抽出する要領を示す説明図である。It is explanatory drawing which shows the point which extracts a company name list based on a search term. 検索語及び特定キーワード間の関連度の根拠を提示する要領を示す説明図である。It is explanatory drawing which shows the point which shows the basis of the relevance degree between a search word and a specific keyword. この発明に係る第２のキーワード間の関連度算出システム及び第２の検索システムの機能構成を示すブロック図である。It is a block diagram which shows the function structure of the related degree calculation system between the 2nd keywords which concerns on this invention, and a 2nd search system. キーワード間の関連度算出工程を示すフローチャートである。It is a flowchart which shows the related degree calculation process between keywords. キーワード間の関連度算出工程を示すフローチャートである。It is a flowchart which shows the related degree calculation process between keywords. 管理サーバから第１の分散処理サーバに対し文書ファイルが分割配信される様子を示す模式図である。It is a schematic diagram which shows a mode that a document file is divided and delivered to a 1st distributed processing server from a management server. 第１の分散処理サーバによって各キーワードの出現頻度二乗値ファイル及びキーワード間の組合せ頻度積値ファイルが生成される様子を示す模式図である。It is a schematic diagram which shows a mode that the appearance frequency square value file of each keyword and the combination frequency product value file between keywords are produced | generated by the 1st distributed processing server. 文字コード範囲を各組合せ頻度積値ファイルに割り当てた例を示す説明図である。It is explanatory drawing which shows the example which allocated the character code range to each combination frequency product value file. 第１の分散処理サーバによってキーワード間の出現頻度積値の総和が算出される様子を示す模式図である。It is a schematic diagram which shows a mode that the sum total of the appearance frequency product value between keywords is calculated by the 1st distributed processing server. 第２の分散処理サーバによって各キーワードの出現頻度二乗値の総和が算出される様子を示す模式図である。It is a schematic diagram which shows a mode that the sum total of the appearance frequency square value of each keyword is calculated by the 2nd distributed processing server. 検索処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a search process.

Explanation of symbols

10 第１のキーワード間の関連度算出システム
11 第１の検索システム
12 文書ＤＢ
14 キーワード抽出部
16 キーワードＤＢ
18 関連度算出部
20 キーワード共起頻度表ＤＢ
22 キーワード組合せ頻度総和表ＤＢ
24 キーワード頻度総和表ＤＢ
26 キーワード関連度表ＤＢ
28 固有名詞ＤＢ
30 検索処理部
32 係り受け表現抽出フィルタ
34 区切り文字抽出フィルタ
36 文字列頻度統計フィルタ
38 TermExtractフィルタ
40 多数決フィルタ
50 第２のキーワード間の関連度算出システム
52 第２の検索システム
54 管理サーバ
56a〜56c 第１の分散処理サーバ
57 第２の分散処理サーバ
58 Webサーバ
60 インターネット
62a〜62c 担当文書ファイル
64a〜64c キーワード抽出処理部
68b ファイル生成部
70a〜70c キーワード出現頻度二乗値ファイル
72a〜72c 組合せ頻度積値ファイル
74a〜74c 組合せ頻度積値ファイル
76a〜76c 組合せ頻度積値ファイル
66 全キーワードデータ
78b ファイル結合部
80b ソート処理部
82b 結合ファイル
84b 加算処理部
86b ソート済みファイル
88b 算出結果ファイル
90 ファイル結合部
91 ソート処理部
92 結合ファイル
93 ソート済みファイル
94 加算処理部
95 算出結果ファイル
α 端末装置 10 Relevance calculation system between first keywords
11 First search system
12 Document DB
14 Keyword extractor
16 Keyword DB
18 Relevance calculator
20 Keyword co-occurrence frequency table DB
22 Keyword combination frequency summary table DB
24 Keyword Frequency Summation Table DB
26 Keyword relevance table DB
28 proper noun DB
30 Search processing section
32 Dependency Expression Extraction Filter
34 Delimiter extraction filter
36 String frequency statistics filter
38 TermExtract filter
40 Majority filter
50 Relevance calculation system between second keywords
52 Second search system
54 Management server
56a-56c First distributed processing server
57 Second distributed processing server
58 Web server
60 Internet
62a-62c Document file in charge
64a-64c Keyword extraction processing unit
68b File generator
70a-70c Keyword appearance frequency square value file
72a to 72c Combination frequency product file
74a to 74c Combination frequency product value file
76a to 76c Combination frequency product file
66 All keyword data
78b File connection
80b Sort processing part
82b Combined file
84b Addition processing unit
86b Sorted files
88b Calculation result file
90 File connection
91 Sort processing section
92 Combined files
93 Sorted files
94 Addition processing section
95 Calculation result file α Terminal device

Claims

Document storage means for storing a plurality of document files;
A keyword extracting means for extracting a plurality of keywords from each document file and storing the extracted keywords in the keyword storage means;
The system includes a relevance calculation unit that calculates the relevance between a pair of keywords for all combinations of keywords based on the appearance frequency of each keyword in each document file and stores the relevance in a keyword relevance storage unit. And
The relevance calculation means is
(1) In a document file unit, a process for detecting keywords that have appeared in the document file and calculating their appearance frequency;
(2) a process of calculating the square value of the appearance frequency of each keyword;
(3) The process of calculating the sum of the squares of the appearance frequency of each keyword and calculating the total over all document files;
(4) A process of calculating , for each pair of keywords , a product of the appearance frequencies of each keyword as a product value of the appearance frequencies between the pair of keywords in a document file unit;
(5) Totaling the product values of the appearance frequency between each keyword and calculating the total over all document files;
(6) A process for calculating the square root of the sum of (3) above;
(7) A process of calculating the degree of association between both keywords by adding the square roots of (6) above of a pair of keywords and dividing the sum of the above (5) by the sum,
This is a system for calculating the degree of association between keywords.

A relevance calculation system between keywords comprising a management server, a plurality of first distributed processing servers, and a second distributed processing server,
Means for distributing a plurality of document files stored in the document storage means to each first distributed processing server;
Means for storing the keyword transmitted from each first distributed processing server in the keyword storage means;
Means for transmitting all the keywords stored in the keyword storage means to the first distributed processing server,
Means for transmitting a plurality of appearance frequency square value files transmitted from each first distributed processing server to the second distributed processing server;
Means for distributing and distributing a plurality of types of combination frequency product values files transmitted from each first distributed processing server to the first distributed processing server to be responsible according to the type;
Means for storing, in the keyword frequency sum total table storage means, the sum total of all the document files of the square values of the appearance frequencies of the keywords transmitted from the second distributed processing server;
Means for storing sum totals over all document files of product values of appearance frequencies between the keywords transmitted from the respective first distributed processing servers in the keyword combination frequency sum table storage means;
Means for retrieving a pair of keywords from the keyword storage means;
Means for taking out a sum of product values of appearance frequencies between the keywords for the pair of keywords from the keyword combination frequency sum table storage means;
From the keyword frequency summation table storage means for the pair of keywords, and means for taking the sum of the square of the frequency of occurrence of each keyword,
A means for calculating the degree of relevance between the two keywords by calculating the square root of each sum, adding both square roots, and dividing the sum of the product values of the appearance frequencies between the keywords by this sum,
The first distributed processing server, a keyword extracting means for extracting a keyword from a responsible document file distributed by the management server;
Means for sending each keyword to the management server;
Means for detecting the presence or absence of each keyword for the document file in charge for each document file when all keywords are transmitted from the management server;
Means for calculating a square value of an appearance frequency of a keyword having an appearance record, and describing each document file in an appearance frequency square value file;
A means for generating a combination of keywords in which the first character code of the first character is arranged first among a pair of keywords having a history of appearance;
For each pair, for each pair of keywords, means for calculating the product of the appearance frequencies of each keyword as the product of the appearance frequencies between the pair of keywords;
Means for identifying the combination frequency product value file to be described by comparing the character code of the first character of the first keyword with the assigned character code ranges of a plurality of combination frequency product value files to which character code ranges are assigned in advance When,
Means for describing the product value for each document file in a corresponding combination frequency product value file;
Means for transmitting the appearance frequency square value file and a plurality of types of combination frequency product value files to the management server;
Means for linking each combination frequency product file when a plurality of similar combination frequency product files are transmitted from the management server;
Means for sorting the keyword combinations described in the concatenated file according to the character code of each keyword;
A means of calculating product values in combination units of the same keyword and calculating a sum total over all document files;
Means for transmitting this sum to the management server,
Means for connecting the appearance frequency square value files when the second distributed processing server transmits a plurality of appearance frequency square value files from the management server;
Means for sorting the keywords described in the concatenated file according to their character codes;
Means for calculating the sum of the squares of the appearance frequency in the same keyword unit and calculating the total over all document files;
A system for calculating a degree of association between keywords, comprising means for transmitting the sum to a management server.

A relevance calculation system between keywords comprising a management server and a plurality of distributed processing servers,
Means for distributing the plurality of document files stored in the document storage means to a plurality of first distributed processing servers comprising at least a part of the plurality of distributed processing servers;
Means for transmitting each of the plurality of keywords stored in the keyword storage means to the first distributed processing server;
A plurality of types of combination frequency product files transmitted from each first distributed processing server are assigned to each of a plurality of second distributed processing servers including at least a part of the plurality of distributed processing servers. Means to distribute and distribute according to the type,
Means for storing sum totals over all document files of product values of appearance frequencies between the keywords transmitted from each second distributed processing server in the keyword combination frequency sum table storage means;
Means for distributing a plurality of document files stored in the document storage means to a plurality of third distributed processing servers comprising at least a part of the plurality of distributed processing servers;
Means for transmitting each of the plurality of keywords stored in the keyword storage means to the third distributed processing server;
A plurality of types of appearance frequency square value files transmitted from each third distributed processing server are assigned to each of a plurality of fourth distributed processing servers including at least a part of the plurality of distributed processing servers. Means to distribute and distribute according to the type,
Means for storing, in the keyword frequency total table storage means, the sum total over all document files of the square value of the appearance frequency of each keyword transmitted from each fourth distributed processing server;
Means for retrieving a pair of keywords from the keyword storage means;
Means for taking out a sum of product values of appearance frequencies between the keywords for the pair of keywords from the keyword combination frequency sum table storage means;
Means for extracting , from the keyword frequency sum table storage means , a sum of square values of appearance frequencies of the keywords for the pair of keywords ;
A means for calculating the degree of relevance between the two keywords by calculating the square root of each sum, adding both square roots, and dividing the sum of the product values of the appearance frequencies between the keywords by this sum,
Each of the first distributed processing servers detects a presence or absence of each keyword for each document file in the document file distributed by the management server;
A means for generating a combination of keywords in which the first character code of the first character is arranged first among a pair of keywords having a history of appearance;
For each pair, for each pair of keywords, means for calculating the product of the appearance frequencies of each keyword as the product of the appearance frequencies between the pair of keywords;
Means for identifying the combination frequency product value file to be described by comparing the character code of the first character of the first keyword with the assigned character code ranges of a plurality of combination frequency product value files to which character code ranges are assigned in advance When,
Means for describing the product value for each document file in a corresponding combination frequency product value file;
Means for transmitting these multiple types of combination frequency product files to the management server,
When each of the second distributed processing servers transmits a plurality of similar combination frequency product value files from the management server, means for connecting the combination frequency product value files;
Means for sorting the keyword combinations described in the concatenated file according to the character code of each keyword;
A means of calculating product values in combination units of the same keyword and calculating a sum total over all document files;
Means for transmitting this sum to the management server,
Each of the above third distributed processing servers detects, for each document file, the presence or absence of each keyword in the document file distributed by the management server;
Means for calculating a square value of the appearance frequency of a keyword with an appearance record;
Means for comparing the character code of each keyword with the assigned character code range of a plurality of appearance frequency square value files to which a character code range is assigned in advance, and specifying an appearance frequency square value file to be described;
Means for describing the square value for each document file in the corresponding appearance frequency square value file;
Means for transmitting these multiple types of appearance frequency square value files to the management server,
Each of the fourth distributed processing servers, when a plurality of similar appearance frequency square value files are transmitted from the management server, means for connecting the appearance frequency square value files;
Means for sorting the keywords described in the concatenated file according to their character codes;
Means for calculating the sum of the squares of the appearance frequency in the same keyword unit and calculating the total over all document files;
A system for calculating the degree of association between keywords, comprising: a means for transmitting the sum to the management server.

A relevance calculation system between keywords comprising a management server and a plurality of distributed processing servers,
Means for distributing the plurality of document files stored in the document storage means to a plurality of first distributed processing servers comprising at least a part of the plurality of distributed processing servers;
Means for transmitting each of the plurality of keywords stored in the keyword storage means to the first distributed processing server;
Means for transmitting the combination frequency product value file transmitted from each first distributed processing server to a second distributed processing server that is one of the plurality of distributed processing servers;
Means for storing sum totals over all document files of product values of appearance frequencies between keywords transmitted from the second distributed processing server in keyword combination frequency sum table storage means;
Means for distributing a plurality of document files stored in the document storage means to a plurality of third distributed processing servers comprising at least a part of the plurality of distributed processing servers;
Means for transmitting each of the plurality of keywords stored in the keyword storage means to the third distributed processing server;
Means for transmitting an appearance frequency square value file transmitted from each third distributed processing server to a fourth distributed processing server that is one of the plurality of distributed processing servers;
Means for storing, in the keyword frequency total table storage means, the sum total over all document files of the square value of the appearance frequency of each keyword transmitted from the fourth distributed processing server;
Means for retrieving a pair of keywords from the keyword storage means;
Means for taking out a sum of product values of appearance frequencies between the keywords for the pair of keywords from the keyword combination frequency sum table storage means;
Means for extracting , from the keyword frequency sum table storage means , a sum of square values of appearance frequencies of the keywords for the pair of keywords ;
A means for calculating the degree of relevance between the two keywords by calculating the square root of each sum, adding both square roots, and dividing the sum of the product values of the appearance frequencies between the keywords by this sum,
Each of the first distributed processing servers detects a presence or absence of each keyword for each document file in the document file distributed by the management server;
A means for generating a combination of keywords in which the first character code of the first character is arranged first among a pair of keywords having a history of appearance;
For each pair, for each pair of keywords, means for calculating the product of the appearance frequencies of each keyword as the product of the appearance frequencies between the pair of keywords;
Means for describing the product value for each document file in the combination frequency product value file;
Means for transmitting this combined frequency product value file to the management server,
The second distributed processing server, when a plurality of combination frequency product value files are transmitted from the management server, means for concatenating each combination frequency product value file;
Means for sorting the keyword combinations described in the concatenated file according to the character code of each keyword;
A means of calculating product values in combination units of the same keyword and calculating a sum total over all document files;
Means for transmitting this sum to the management server,
Each of the above third distributed processing servers detects, for each document file, the presence or absence of each keyword in the document file distributed by the management server;
Means for calculating a square value of the appearance frequency of a keyword with an appearance record;
Means for describing the square value for each document file in an appearance frequency square value file;
Means for transmitting the appearance frequency square value file to the management server,
Means for connecting the appearance frequency square value files when the fourth distributed processing server transmits a plurality of appearance frequency square value files from the management server;
Means for sorting the keywords described in the concatenated file according to their character codes;
Means for calculating the sum of the squares of the appearance frequency in the same keyword unit and calculating the total over all document files;
A system for calculating a degree of association between keywords, comprising means for transmitting the sum to a management server.

A relevance calculation system between keywords comprising a management server and a plurality of distributed processing servers,
Means for distributing the plurality of document files stored in the document storage means to a plurality of first distributed processing servers comprising at least a part of the plurality of distributed processing servers;
Means for transmitting each of the plurality of keywords stored in the keyword storage means to the first distributed processing server;
A plurality of types of combination frequency product files transmitted from each first distributed processing server are assigned to each of a plurality of second distributed processing servers including at least a part of the plurality of distributed processing servers. Means to distribute and distribute according to the type,
Means for storing sum totals over all document files of product values of appearance frequencies between keywords transmitted from the second distributed processing server in keyword combination frequency sum table storage means;
Means for distributing a plurality of document files stored in the document storage means to a plurality of third distributed processing servers comprising at least a part of the plurality of distributed processing servers;
Means for transmitting each of the plurality of keywords stored in the keyword storage means to the third distributed processing server;
Means for transmitting an appearance frequency square value file transmitted from each third distributed processing server to a fourth distributed processing server that is one of the plurality of distributed processing servers;
Means for storing, in the keyword frequency total table storage means, the sum total over all document files of the square value of the appearance frequency of each keyword transmitted from the fourth distributed processing server;
Means for retrieving a pair of keywords from the keyword storage means;
Means for taking out a sum of product values of appearance frequencies between the keywords for the pair of keywords from the keyword combination frequency sum table storage means;
Means for extracting , from the keyword frequency sum table storage means , a sum of square values of appearance frequencies of the keywords for the pair of keywords ;
A means for calculating the degree of relevance between the two keywords by calculating the square root of each sum, adding both square roots, and dividing the sum of the product values of the appearance frequencies between the keywords by this sum,
Each of the first distributed processing servers detects a presence or absence of each keyword for each document file in the document file distributed by the management server;
A means for generating a combination of keywords in which the first character code of the first character is arranged first among a pair of keywords having a history of appearance;
For each pair, for each pair of keywords, means for calculating the product of the appearance frequencies of each keyword as the product of the appearance frequencies between the pair of keywords;
Means for identifying the combination frequency product value file to be described by comparing the character code of the first character of the first keyword with the assigned character code ranges of a plurality of combination frequency product value files to which character code ranges are assigned in advance When,
Means for describing the product value for each document file in a corresponding combination frequency product value file;
Means for transmitting these multiple types of combination frequency product files to the management server,
When each of the second distributed processing servers transmits a plurality of similar combination frequency product value files from the management server, means for connecting the combination frequency product value files;
Means for sorting the keyword combinations described in the concatenated file according to the character code of each keyword;
A means of calculating product values in combination units of the same keyword and calculating a sum total over all document files;
Means for transmitting this sum to the management server,
Each of the above third distributed processing servers detects, for each document file, the presence or absence of each keyword in the document file distributed by the management server;
Means for calculating a square value of the appearance frequency of a keyword with an appearance record;
Means for describing the square value for each document file in an appearance frequency square value file;
Means for transmitting the appearance frequency square value file to the management server,
Means for connecting the appearance frequency square value files when the fourth distributed processing server transmits a plurality of appearance frequency square value files from the management server;
Means for sorting the keywords described in the concatenated file according to their character codes;
Means for calculating the sum of the squares of the appearance frequency in the same keyword unit and calculating the total over all document files;
A system for calculating a degree of association between keywords, comprising means for transmitting the sum to a management server.

A relevance calculation system between keywords comprising a management server and a plurality of distributed processing servers,
Means for distributing the plurality of document files stored in the document storage means to a plurality of first distributed processing servers comprising at least a part of the plurality of distributed processing servers;
Means for transmitting each of the plurality of keywords stored in the keyword storage means to the first distributed processing server;
Means for transmitting the combination frequency product value file transmitted from each first distributed processing server to a second distributed processing server that is one of the plurality of distributed processing servers;
Means for storing sum totals over all document files of product values of appearance frequencies between keywords transmitted from the second distributed processing server in keyword combination frequency sum table storage means;
Means for distributing a plurality of document files stored in the document storage means to a plurality of third distributed processing servers comprising at least a part of the plurality of distributed processing servers;
Means for transmitting each of the plurality of keywords stored in the keyword storage means to the third distributed processing server;
A plurality of types of appearance frequency square value files transmitted from each third distributed processing server are assigned to each of a plurality of fourth distributed processing servers including at least a part of the plurality of distributed processing servers. Means to distribute and distribute according to the type,
Means for storing, in the keyword frequency total table storage means, the sum total over all document files of the square value of the appearance frequency of each keyword transmitted from each fourth distributed processing server;
Means for retrieving a pair of keywords from the keyword storage means;
Means for taking out a sum of product values of appearance frequencies between the keywords for the pair of keywords from the keyword combination frequency sum table storage means;
Means for extracting , from the keyword frequency sum table storage means , a sum of square values of appearance frequencies of the keywords for the pair of keywords ;
A means for calculating the degree of relevance between the two keywords by calculating the square root of each sum, adding both square roots, and dividing the sum of the product values of the appearance frequencies between the keywords by this sum,
Each of the first distributed processing servers detects a presence or absence of each keyword for each document file in the document file distributed by the management server;
A means for generating a combination of keywords in which the first character code of the first character is arranged first among a pair of keywords having a history of appearance;
For each pair, for each pair of keywords, means for calculating the product of the appearance frequencies of each keyword as the product of the appearance frequencies between the pair of keywords;
Means for describing the product value for each document file in the combination frequency product value file;
Means for transmitting this combined frequency product value file to the management server,
The second distributed processing server, when a plurality of combination frequency product value files are transmitted from the management server, means for concatenating each combination frequency product value file;
Means for sorting the keyword combinations described in the concatenated file according to the character code of each keyword;
A means of calculating product values in combination units of the same keyword and calculating a sum total over all document files;
Means for transmitting this sum to the management server,
Each of the above third distributed processing servers detects, for each document file, the presence or absence of each keyword in the document file distributed by the management server;
Means for calculating a square value of the appearance frequency of a keyword with an appearance record;
Means for comparing the character code of each keyword with the assigned character code range of a plurality of appearance frequency square value files to which a character code range is assigned in advance, and specifying an appearance frequency square value file to be described;
Means for describing the square value for each document file in the corresponding appearance frequency square value file;
Means for transmitting these multiple types of appearance frequency square value files to the management server,
Each of the fourth distributed processing servers, when a plurality of similar appearance frequency square value files are transmitted from the management server, means for connecting the appearance frequency square value files;
Means for sorting the keywords described in the concatenated file according to their character codes;
Means for calculating the sum of the squares of the appearance frequency in the same keyword unit and calculating the total over all document files;
A system for calculating a degree of association between keywords, comprising means for transmitting the sum to a management server.

The management server distributes a plurality of document files stored in the document storage means in advance to a plurality of distributed processing servers including at least a part of the plurality of distributed processing servers, and instructs keyword extraction. Means to
Means for storing the keyword transmitted from each distributed processing server in the keyword storage means,
Each of the distributed processing servers described above extracts a keyword from a responsible document file distributed by the management server;
The system according to claim 3, further comprising means for transmitting each keyword to the management server.

A method for calculating a degree of association between keywords based on cooperation between a management server, a plurality of first distributed processing servers, and a second distributed processing server,
The management server distributing a plurality of document files stored in the document storage means to each first distributed processing server;
Each first distributed processing server extracting a keyword from the responsible document file transmitted by the management server and transmitting the keyword to the management server;
The management server stores the keywords transmitted from each first distributed processing server in the keyword storage means, and then transmits all the keywords to the first distributed processing server;
Each of the first distributed processing servers receiving this detects the presence or absence of each keyword for the document file in charge for each document file;
Calculating a square value of an appearance frequency of a keyword having an appearance record and describing each document file in an appearance frequency square value file;
Generating a combination of keywords in which the first character code of the first character is arranged first among a pair of keywords having a record of appearance;
For each pair, for each pair of keywords , calculating the product of the appearance frequency of each keyword as the product value of the appearance frequency between the pair of keywords;
A step of identifying a combination frequency product value file to be described by comparing the character code of the first character of the first keyword with the assigned character code ranges of a plurality of combination frequency product value files to which a character code range is assigned in advance When,
A step of describing a product value of appearance frequencies between keywords related to each combination for each document file in a corresponding combination frequency product value file;
Transmitting the appearance frequency square value file and a plurality of types of combination frequency product value files to the management server;
A management server transmitting a plurality of appearance frequency square value files transmitted from each first distributed processing server to the second distributed processing server;
Distributing and distributing a plurality of types of combination frequency product value files transmitted from each first distributed processing server to the first distributed processing servers each assigned to each type of combination frequency product value file;
A second distributed processing server to which a plurality of appearance frequency square value files are transmitted from the management server, and concatenating each appearance frequency square value file;
Sorting each keyword described in the concatenated file according to each character code;
A step of calculating the square value of the appearance frequency in the same keyword unit and calculating a sum total over all document files;
Sending the sum to the management server;
The management server stores the sum of the squares of the appearance frequencies transmitted from the second distributed processing server in the keyword frequency sum table storage means;
A first distributed processing server to which a plurality of combination frequency product value files are transmitted from the management server, concatenating each combination frequency product value file;
Sorting each keyword combination described in the concatenated file according to the character code of each keyword;
A step of calculating product values of appearance frequencies in combination units of the same keyword and calculating a sum total over all document files;
Sending the sum to the management server;
A management server storing the sum of product values transmitted from each first distributed processing server in the keyword combination frequency sum table storage means;
Retrieving a pair of keywords from the keyword storage means;
Extracting a sum of product values of appearance frequencies between the keywords for the pair of keywords from the keyword combination frequency sum table storage unit;
Extracting the sum of squares of the appearance frequencies of the keywords for the pair of keywords from the keyword frequency total table storage means;
Calculating the relevance between both keywords by calculating the square root of each sum, adding the square root of both keywords, and dividing the sum of the product values of the appearance frequencies between the keywords by this sum;
A method for calculating the degree of association between keywords, comprising: