JP2013077272A

JP2013077272A - Method, apparatus and computer program for obtaining keyword appearance frequency ranking

Info

Publication number: JP2013077272A
Application number: JP2011218226A
Authority: JP
Inventors: Issei Yoshida; 一星吉田; Yuta Tsuboi; 祐太坪井; Yuya Unno; 裕也海野
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2011-09-30
Filing date: 2011-09-30
Publication date: 2013-04-25

Abstract

PROBLEM TO BE SOLVED: To provide a method, apparatus and computer program for efficiently calculating keyword appearance frequency ranking in distributed processing with a plurality of computers.SOLUTION: Keywords which do not overlap each other are assigned to n computers, where n is a natural number. From the keywords assigned to each computer, keywords of the highest frequency ranks are acquired. From all the acquired keywords, k keywords of the highest overall frequency ranks are selected, where k is a natural number. In doing this, probability P(n, k, t) that the keywords of the highest frequency ranks in each computer are selected as the k highest overall frequency ranks is calculated, where t is a natural number satisfying t<k and represents the number of keywords acquired from each computer as the keywords of the highest frequency ranks. Then, t is estimated on the basis of the calculated probability P(n, k, t), t keywords of the highest frequency ranks are acquired from each computer, and k keywords of the highest overall frequency ranks are selected from all the acquired keywords.

Description

本発明は、複数のコンピュータでキーワードが出現する頻度の計数を分散処理する場合に、キーワードが出現する頻度順を効率良く求める方法、装置及びコンピュータプログラムに関する。 The present invention relates to a method, an apparatus, and a computer program for efficiently obtaining a frequency order in which keywords appear when a frequency count of keywords appearing in a plurality of computers is distributed.

いわゆるテキストマイニングのように、大量のデータ（テキストマイニングの場合は文書データ）の中から、頻度順上位ｋ（ｋは自然数）個のデータ（テキストマイニングの場合はキーワード）及び出現頻度のリストを求めるシステムでは、演算処理負荷が過大となる。特に対話的なテキスト分析を行う場合、上位ｋ個のキーワードから次の検索キーワードを選択し、選択されたキーワードによる検索結果に対して新たに上位ｋ個のキーワードリストを求める、というように繰り返し上位ｋ個のキーワードを求める処理を実行するので、レスポンスタイムをいかに短くするかが重要な課題となる。 Like so-called text mining, a list of k data (keyword in the case of text mining) and appearance frequency and a frequency of appearance are obtained from a large amount of data (document data in the case of text mining). In the system, the calculation processing load becomes excessive. In particular, when performing interactive text analysis, the next search keyword is selected from the top k keywords, and the top k keyword list is newly obtained for the search result based on the selected keyword. Since a process for obtaining k keywords is executed, how to shorten the response time is an important issue.

文書数が膨大である場合、一般に複数のコンピュータに処理を分割して並列処理を行う。例えば文書集合を分割して小さな集合の集まりにして、複数のコンピュータに各小集合を配置・記憶させておき、各コンピュータで頻度順上位ｋ個のキーワードを求める。そして、全てのコンピュータで求めたキーワードの出現頻度を集約することにより頻度順上位ｋ個のキーワードを求める。すなわち、コンピュータ数がｎ（ｎは自然数）の場合、ｋ×ｎ個から頻度順上位ｋ個を選ぶ。 When the number of documents is enormous, the processing is generally divided into a plurality of computers to perform parallel processing. For example, the document set is divided into small sets, each small set is arranged and stored in a plurality of computers, and the top k keywords in order of frequency are obtained by each computer. Then, the top k keywords in order of frequency are obtained by aggregating the appearance frequencies of the keywords obtained by all computers. That is, when the number of computers is n (n is a natural number), the top k in order of frequency are selected from k × n.

しかし、上述した方法では、全体として正しい頻度順上位ｋ個のキーワードを求めているという保証は無い。そこで、文書集合を複数のコンピュータに分割して記憶するのではなく、例えば特許文献１のように、出現頻度を求めるキーワードに関する情報を複数のコンピュータに分割して記憶させておくことで、各コンピュータでは記憶してあるキーワードについてのみ頻度順上位ｋ個のキーワードを求めることができ、それらを集約することで全体として正しい頻度順上位ｋ個のキーワードを求めることができる。 However, in the above-described method, there is no guarantee that the top k keywords in the correct frequency order are obtained as a whole. Therefore, rather than dividing a document set into a plurality of computers and storing the information, for example, as in Patent Document 1, information relating to a keyword for determining an appearance frequency is divided into a plurality of computers and stored. Then, it is possible to obtain the top k keywords in the frequency order only for the stored keywords, and aggregate them to obtain the top k keywords in the correct frequency order as a whole.

特許第４１７２８０１号公報Japanese Patent No. 4172801

しかし、ｎ（ｎは自然数）個のコンピュータで分散処理する場合、頻度順上位ｋ個のキーワードを算出するためにはｎ×ｋ個のキーワードについて出現頻度を求める必要がある。例えばｎ＝３２、ｋ＝１００である場合、最終的に求めたいキーワードの個数は１００個であるのに対して、３２×１００−１００＝３１００個の不要なキーワードについても出現頻度を求める必要があり、効率が悪い。 However, when distributed processing is performed with n (n is a natural number) computers, in order to calculate the top k keywords in the order of frequency, it is necessary to obtain the appearance frequency for n × k keywords. For example, when n = 32 and k = 100, the number of keywords to be finally obtained is 100, but it is necessary to obtain the appearance frequency of 32 × 100−100 = 3100 unnecessary keywords. Yes, efficiency is poor.

本発明は斯かる事情に鑑みてなされたものであり、複数のコンピュータで分散処理する場合に、キーワードが出現する頻度順を効率良く算出する方法、装置及びコンピュータプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and an object thereof is to provide a method, an apparatus, and a computer program for efficiently calculating the frequency order in which keywords appear when distributed processing is performed by a plurality of computers. .

上記目的を達成するために第１発明に係る方法は、複数のコンピュータを用いて頻度順上位のキーワードを求める装置で実行することが可能な方法であって、前記装置は、ｎ（ｎは自然数）個のコンピュータに、互いに重なり合わないキーワードを割り当てるステップと、各コンピュータに割り当てられたキーワードから頻度順上位ｔ個（ｔは自然数、ｔ＜ｋ）のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数）個のキーワードを選択する場合に、選択されたｋ個が真の頻度順上位ｋ個となる確率Ｐ（ｎ、ｋ、ｔ）を算出するステップと、算出した確率Ｐ（ｎ、ｋ、ｔ）に基づいて、各コンピュータから頻度順上位のキーワードとして取得するキーワードの個数ｔ（ｔは自然数、ｔ＜ｋ）を推定するステップとを有し、各コンピュータから頻度順上位ｔ個のキーワードを取得して、取得したすべてのキーワードから全体として頻度順上位ｋ個のキーワードを選択する。 In order to achieve the above object, the method according to the first aspect of the present invention is a method that can be executed by a device that obtains a keyword in order of frequency using a plurality of computers, wherein the device has n (n is a natural number). ) Assigning non-overlapping keywords to the computers, and acquiring the top t keywords (t is a natural number, t <k) in order of frequency from the keywords assigned to each computer, and from all the acquired keywords A step of calculating a probability P (n, k, t) that the selected k items are the top k in the true frequency order when selecting k keywords in the order of frequency as a whole (k is a natural number); Based on the calculated probability P (n, k, t), the number t of keywords to be acquired from each computer as the top keyword in frequency order (t is a natural number, t <k) is estimated. That has a step, to obtain the frequency-ordered upper t keyword (s) from each computer, to select the frequency order top k keyword as a whole from all the keywords acquired.

また、第２発明に係る方法は、第１発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを選択する場合に、選択するコンピュータごとのキーワード数の組み合わせにおいて、いずれのコンピュータからも多くとも（ｔ−１）個のキーワードを選択する」確率である。 Further, in the method according to the second invention, in the first invention, the probability P (n, k, t) is “a computer to be selected when selecting k keywords in order of frequency from n × t keywords. The probability of selecting at most (t−1) keywords from any computer in the combination of the number of keywords for each.

また、第３発明に係る方法は、第１発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを頻度順序を含めて選択する場合に、選択するコンピュータごとのキーワード数の組み合わせにおいて、いずれのコンピュータからも多くとも（ｔ−１）個のキーワードを選択する」確率である。 In the method according to the third invention, in the first invention, the probability P (n, k, t) is “when the top k keywords in the frequency order are selected from the n × t keywords including the frequency order”. In addition, in the combination of the number of keywords for each computer to be selected, the probability is that at most (t−1) keywords are selected from any computer.

また、第４発明に係る方法は、第１発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ個のコンピュータから頻度順上位ｋ個のキーワードを選択する組み合わせを選んだ場合に、選んだ組み合わせにおいて、いずれのコンピュータからも多くともｔ個のキーワードを選択する」確率である。 In the method according to the fourth invention, in the first invention, the probability P (n, k, t) is selected when “a combination that selects k keywords in order of frequency from n computers is selected”. The probability of selecting at most t keywords from any computer in the combination.

次に、上記目的を達成するために第５発明に係る方法は、複数のコンピュータを用いて頻度順上位のキーワードを求める装置で実行することが可能な方法であって、前記装置は、ｎ（ｎは自然数）個のコンピュータに、互いに重なり合わないキーワードを割り当てるステップと、各コンピュータから頻度順上位ｔ（ｔは自然数）個のキーワードを取得するステップと、各コンピュータに割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数、ｔ＜ｋ）個のキーワードを選択する場合に、各コンピュータで頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であり、かつ各クライアントにおいて選択されたキーワードのうち最小の頻度とそのクライアントのｔ番目のキーワードの頻度が異なるか否かを判断するステップとを有し、各コンピュータで頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であり、かつ各クライアントにおいて選択されたキーワードのうち最小の頻度とそのクライアントのｔ番目のキーワードの頻度が異なると判断した場合、各コンピュータから取得したすべてのキーワードのうち、各コンピュータの頻度順上位（ｔ−１）個のキーワードから全体として頻度順上位ｋ個のキーワードを選択する。 Next, in order to achieve the above object, a method according to the fifth aspect of the present invention is a method that can be executed by a device that obtains a keyword in order of frequency using a plurality of computers, and the device includes n ( n is a natural number) assigning non-overlapping keywords to each computer, obtaining the top t (t is a natural number) keywords in order of frequency from each computer, and the order of frequency from the keywords assigned to each computer Keywords acquired as top keywords in order of frequency in each computer when top keywords are acquired and k keywords (k is a natural number, t <k) in order of frequency are selected from all the acquired keywords as a whole The number of keywords is (t-1) or less and the smallest of the keywords selected in each client And the step of determining whether or not the frequency of the t-th keyword of the client is different, and the number of keywords acquired as a top-order keyword in each computer is (t−1) or less. And when it is determined that the minimum frequency among the keywords selected in each client is different from the frequency of the t-th keyword of the client, among all the keywords acquired from each computer, -1) Select the top k keywords in order of frequency from the keywords.

次に、上記目的を達成するために第６発明に係る装置は、複数のコンピュータを用いて頻度順上位のキーワードを求める装置であって、ｎ（ｎは自然数）個のコンピュータに、互いに重なり合わないキーワードを割り当てるキーワード割当部と、各コンピュータに割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数）個のキーワードを選択する場合に、各コンピュータの頻度順上位のキーワードが頻度順上位ｋ個となる確率Ｐ（ｎ、ｋ、ｔ）を算出する確率算出部と、算出した確率Ｐ（ｎ、ｋ、ｔ）に基づいて、各コンピュータから頻度順上位のキーワードとして取得するキーワードの個数ｔ（ｔは自然数、ｔ＜ｋ）を推定する推定部とを備え、各コンピュータから頻度順上位ｔ個のキーワードを取得して、取得したすべてのキーワードから全体として頻度順上位ｋ個のキーワードを選択する。 Next, in order to achieve the above object, an apparatus according to the sixth aspect of the present invention is an apparatus that uses a plurality of computers to obtain keywords in order of frequency and overlaps n (n is a natural number) computers. When a keyword assigning unit for assigning non-keywords and keywords assigned to each computer are acquired from the keywords assigned to each computer in the order of frequency, and k keywords (k is a natural number) as a whole are selected from all the acquired keywords. In addition, based on the probability calculation unit for calculating the probability P (n, k, t) that the top keyword in the frequency order of each computer is the top k in the frequency order, and the calculated probability P (n, k, t), An estimation unit for estimating the number of keywords t (t is a natural number, t <k) to be acquired from each computer as a keyword in order of frequency. To get the order of frequency of the upper t number of keywords from over data, to select the order of frequency of top k keyword as a whole from all of the keywords you get.

また、第７発明に係る装置は、第６発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを選択する場合に、選択するコンピュータごとのキーワード数の組み合わせにおいて、いずれのコンピュータからも多くとも（ｔ−１）個のキーワードを選択する」確率である。 According to a seventh aspect of the present invention, in the sixth aspect of the invention, the probability P (n, k, t) is “a computer to be selected when selecting the top k keywords in order of frequency from the n × t keywords. The probability of selecting at most (t−1) keywords from any computer in the combination of the number of keywords for each.

また、第８発明に係る装置は、第６発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを頻度順序を含めて選択する場合に、選択するコンピュータごとのキーワード数の組み合わせにおいて、いずれのコンピュータからも多くとも（ｔ−１）個のキーワードを選択する」確率である。 In the device according to the eighth invention, in the sixth invention, the probability P (n, k, t) is “when the top k keywords in order of frequency are selected from the n × t keywords including the frequency order”. In addition, in the combination of the number of keywords for each computer to be selected, the probability is that at most (t−1) keywords are selected from any computer.

また、第９発明に係る装置は、第６発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ個のコンピュータから頻度順上位ｋ個のキーワードを選択する組み合わせを選んだ場合に、選んだ組み合わせにおいて、いずれのコンピュータからも多くともｔ個のキーワードを選択する」確率である。 In the device according to the ninth invention, in the sixth invention, the probability P (n, k, t) is selected when “a combination that selects k keywords in order of frequency from n computers is selected”. The probability of selecting at most t keywords from any computer in the combination.

次に、上記目的を達成するために第１０発明に係る装置は、複数のコンピュータを用いて頻度順上位のキーワードを求める装置であって、ｎ（ｎは自然数）個のコンピュータに、互いに重なり合わないキーワードを割り当てるキーワード割当部と、各コンピュータから頻度順上位ｔ（ｔは自然数）個のキーワードを取得するキーワード取得部と、各コンピュータに割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数、ｔ＜ｋ）個のキーワードを選択する場合に、各コンピュータで頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であり、かつ各クライアントにおいて選択されたキーワードのうち最小の頻度とそのクライアントのｔ番目のキーワードの頻度が異なるか否かを判断する判断部とを備え、各コンピュータで頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であり、かつ各クライアントにおいて選択されたキーワードのうち最小の頻度とそのクライアントのｔ番目のキーワードの頻度が異なると判断した場合、各コンピュータから取得したすべてのキーワードのうち、各コンピュータの頻度順上位（ｔ−１）個のキーワードから全体として頻度順上位ｋ個のキーワードを選択する。 Next, in order to achieve the above object, an apparatus according to the tenth aspect of the present invention is an apparatus that uses a plurality of computers to obtain keywords in order of frequency and overlaps n (n is a natural number) computers. A keyword assigning unit that assigns no keywords, a keyword obtaining unit that obtains the top t (t is a natural number) keywords in order of frequency from each computer, and a keyword in the order of frequency obtained from the keywords assigned to each computer. In the case where the top k (k is a natural number, t <k) keywords in order of frequency are selected from all the keywords as a whole, the number of keywords acquired as the top keywords in frequency order by each computer is (t−1). The minimum frequency of keywords selected by each client and their clients A determination unit for determining whether or not the frequency of the t-th keyword is different, and the number of keywords acquired as top-ranked keywords in each computer is (t−1) or less, and each client If it is determined that the minimum frequency among the keywords selected in step 1 differs from the frequency of the t-th keyword of the client, the top (t−1) items in order of frequency of each computer out of all the keywords acquired from each computer The top k keywords in order of frequency are selected as a whole.

次に、上記目的を達成するために第１１発明に係るコンピュータプログラムは、複数のコンピュータを用いて頻度順上位のキーワードを求める装置で実行することが可能なコンピュータプログラムであって、前記装置を、ｎ（ｎは自然数）個のコンピュータに、互いに重なり合わないキーワードを割り当てるキーワード割当手段、各コンピュータに割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数）個のキーワードを選択する場合に、各コンピュータの頻度順上位のキーワードが頻度順上位ｋ個となる確率Ｐ（ｎ、ｋ、ｔ）を算出する確率算出手段、算出した確率Ｐ（ｎ、ｋ、ｔ）に基づいて、各コンピュータから頻度順上位のキーワードとして取得するキーワードの個数ｔ（ｔは自然数、ｔ＜ｋ）を推定する推定手段、及び各コンピュータから頻度順上位ｔ個のキーワードを取得して、取得したすべてのキーワードから全体として頻度順上位ｋ個のキーワードを選択する手段として機能させる。 Next, in order to achieve the above object, a computer program according to an eleventh aspect of the present invention is a computer program that can be executed by a device that obtains a keyword in order of frequency using a plurality of computers, Keyword assigning means for assigning non-overlapping keywords to n (n is a natural number) computers, keywords in higher frequency order are obtained from the keywords assigned to each computer, and the overall frequency order is obtained from all the obtained keywords. Probability calculating means for calculating the probability P (n, k, t) that the top keyword in the frequency order of each computer is k when selecting k (k is a natural number) keywords, the calculated probability Based on P (n, k, t), obtain from each computer as the top keyword in frequency order Estimating means for estimating the number of keywords t (t is a natural number, t <k), and the top t keywords in frequency order from each computer, and the top k keywords in frequency order as a whole from all the acquired keywords It functions as a means for selecting.

また、第１２発明に係るコンピュータプログラムは、第１１発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを選択する場合に、選択するコンピュータごとのキーワード数の組み合わせにおいて、いずれのコンピュータからも多くとも（ｔ−１）個のキーワードを選択する」確率である。 In the computer program according to the twelfth aspect, in the eleventh aspect, the probability P (n, k, t) is selected when “k keywords having the highest frequency order are selected from n × t keywords”. The probability of selecting at most (t−1) keywords from any computer in the combination of the number of keywords for each computer.

また、第１３発明に係るコンピュータプログラムは、第１１発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを頻度順序を含めて選択する場合に、選択するコンピュータごとのキーワード数の組み合わせにおいて、いずれのコンピュータからも多くとも（ｔ−１）個のキーワードを選択する」確率である。 In the computer program according to the thirteenth aspect, in the eleventh aspect, the probability P (n, k, t) selects “k keywords including the frequency order from the n × t keywords in the order of frequency”. In this case, the probability of selecting at most (t−1) keywords from any computer in the combination of the number of keywords for each computer to be selected.

また、第１４発明に係るコンピュータプログラムは、第１１発明において、確率Ｐ（ｎ、ｋ、ｔ）は、「ｎ個のコンピュータから頻度順上位ｋ個のキーワードを選択する組み合わせを選んだ場合に、選んだ組み合わせにおいて、いずれのコンピュータからも多くともｔ個のキーワードを選択する」確率である。 Further, in the computer program according to the fourteenth invention, in the eleventh invention, the probability P (n, k, t) is “when a combination that selects k keywords in order of frequency from n computers is selected, It is the probability of selecting at most t keywords from any computer in the selected combination.

次に、上記目的を達成するために第１５発明に係るコンピュータプログラムは、複数のコンピュータを用いて頻度順上位のキーワードを求める装置で実行することが可能なコンピュータプログラムであって、前記装置を、ｎ（ｎは自然数）個のコンピュータに、互いに重なり合わないキーワードを割り当てるキーワード割当手段、各コンピュータから頻度順上位ｔ（ｔは自然数）個のキーワードを取得するキーワード取得手段、各コンピュータに割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数、ｔ＜ｋ）個のキーワードを選択する場合に、各コンピュータで頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であり、かつ各クライアントにおいて選択されたキーワードのうち最小の頻度とそのクライアントのｔ番目のキーワードの頻度が異なるか否かを判断する判断手段、及び各コンピュータで頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であり、かつ各クライアントにおいて選択されたキーワードのうち最小の頻度とそのクライアントのｔ番目のキーワードの頻度が異なると判断した場合、各コンピュータから取得したすべてのキーワードのうち、各コンピュータの頻度順上位（ｔ−１）個のキーワードから全体として頻度順上位ｋ個のキーワードを選択する手段として機能させる。 Next, in order to achieve the above object, a computer program according to the fifteenth aspect of the present invention is a computer program that can be executed by a device that obtains keywords in order of frequency using a plurality of computers, Keyword assignment means for assigning non-overlapping keywords to n (n is a natural number) computers, keyword acquisition means for obtaining top t (t is a natural number) keywords in order of frequency from each computer, assigned to each computer When keywords with the highest frequency order are acquired from the keywords and k keywords with the highest frequency order (k is a natural number, t <k) are selected as a whole from all the acquired keywords, The number of acquired keywords is (t-1) or less, and The judging means for judging whether or not the minimum frequency among the keywords selected in the client is different from the frequency of the t-th keyword of the client, and the number of keywords acquired by each computer as the top keywords in the frequency order ( t-1) or less, and if it is determined that the minimum frequency of the keywords selected in each client is different from the frequency of the t-th keyword of the client, among all the keywords acquired from each computer, It is made to function as means for selecting the top k keywords in order of frequency from the top (t−1) keywords in order of frequency of each computer.

本発明によれば、頻度順上位ｋ個のキーワードを求めるために、複数のコンピュータすべてにおいて頻度順上位ｋ個のキーワードを求める必要がなく、ｋ個より少ないｔ個のキーワードのみを求めれば良いので、各コンピュータでの演算処理負荷を軽減することができるとともに、全体として頻度順上位ｋ個のキーワードを求めるために各コンピュータで求める頻度順上位キーワードの個数を絞り込むことができるので、検索結果のレスポンスタイムを大きく短縮することが可能となる。 According to the present invention, in order to obtain the top k keywords in the order of frequency, it is not necessary to obtain the top k keywords in the order of frequency in all the computers, and only t keywords less than k need to be obtained. The calculation processing load on each computer can be reduced, and the number of top keywords in order of frequency in each computer can be narrowed down in order to obtain the top k keywords in order of frequency. Time can be greatly reduced.

本発明の実施の形態に係る頻度順キーワード抽出システムの構成を模式的に示すブロック図である。It is a block diagram which shows typically the structure of the frequency order keyword extraction system which concerns on embodiment of this invention. 本発明の実施の形態に係る情報処理装置の機能ブロック図である。It is a functional block diagram of an information processor concerning an embodiment of the invention. 本発明の実施の形態に係る頻度順キーワード抽出システムのキーワード割り当ての例示図である。It is an illustration figure of the keyword allocation of the frequency order keyword extraction system which concerns on embodiment of this invention. 本発明の実施の形態に係る頻度順キーワード抽出システムの確率Ｐ（ｎ、ｋ、ｔ）の算出方法を説明するための模式図である。It is a schematic diagram for demonstrating the calculation method of the probability P (n, k, t) of the frequency order keyword extraction system which concerns on embodiment of this invention. 本発明の実施の形態に係る情報処理装置のＣＰＵの処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of CPU of the information processing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る情報処理装置の、確率を算出しない場合の機能ブロック図である。It is a functional block diagram in the case of not calculating the probability of the information processing apparatus according to the embodiment of the present invention. 本発明の実施の形態に係る情報処理装置のＣＰＵの、確率を算出しない場合の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence when not calculating the probability of CPU of the information processing apparatus which concerns on embodiment of this invention. 選択個数ｋ及び閾値確率ｐの指定を受け付ける画面の例示図である。It is an illustration figure of the screen which receives designation | designated of the selection number k and threshold value probability p. 確率算出方法ごとのｔの計算値の例を示す表である。It is a table | surface which shows the example of the calculated value of t for every probability calculation method. ｎ＝３２、ｐ＝９０％である場合の頻度順上位ｋ個のキーワードを選択する処理の実行時間（秒）を示す表である。It is a table | surface which shows the execution time (second) of the process which selects k keyword by the frequency order top when n = 32 and p = 90%.

以下、本発明の実施の形態に係る、頻度順上位のキーワードを求める装置について、図面に基づいて具体的に説明する。以下の実施の形態は、特許請求の範囲に記載された発明を限定するものではなく、実施の形態の中で説明されている特徴的事項の組み合わせの全てが解決手段の必須事項であるとは限らないことは言うまでもない。 Hereinafter, an apparatus for obtaining keywords in order of frequency according to an embodiment of the present invention will be specifically described with reference to the drawings. The following embodiments do not limit the invention described in the claims, and all combinations of characteristic items described in the embodiments are essential to the solution. It goes without saying that it is not limited.

また、本発明は多くの異なる態様にて実施することが可能であり、実施の形態の記載内容に限定して解釈されるべきものではない。実施の形態を通じて同じ要素には同一の符号を付している。 The present invention can be implemented in many different modes and should not be construed as being limited to the description of the embodiment. The same symbols are attached to the same elements throughout the embodiments.

以下の実施の形態では、コンピュータシステムにコンピュータプログラムを導入した装置について説明するが、当業者であれば明らかな通り、本発明はその一部をコンピュータで実行することが可能なコンピュータプログラムとして実施することができる。したがって、本発明は、キーワードが出現する頻度順を算出する装置というハードウェアとしての実施の形態、ソフトウェアとしての実施の形態、又はソトウェアとハードウェアとの組み合わせの実施の形態をとることができる。コンピュータプログラムは、ハードディスク、ＤＶＤ、ＣＤ、光記憶装置、磁気記憶装置等の任意のコンピュータで読み取ることが可能な記録媒体に記録することができる。 In the following embodiments, an apparatus in which a computer program is introduced into a computer system will be described. As will be apparent to those skilled in the art, the present invention is implemented as a computer program that can be partially executed by a computer. be able to. Therefore, the present invention can take the form of hardware as an apparatus for calculating the frequency order in which keywords appear, the form of software, or the combination of software and hardware. The computer program can be recorded on any computer-readable recording medium such as a hard disk, DVD, CD, optical storage device, magnetic storage device or the like.

本発明の実施の形態によれば、頻度順上位ｋ個のキーワードを求めるために、複数のコンピュータすべてにおいて頻度順上位ｋ個のキーワードを求める必要がなく、ｋ個より少ないｔ個のキーワードのみを求めれば良いので、各コンピュータでの演算処理負荷を軽減することができるとともに、全体として頻度順上位ｋ個のキーワードを求めるために各コンピュータで求める頻度順上位キーワードの個数を絞り込むことができるので、検索結果のレスポンスタイムを大きく短縮することが可能となる。 According to the embodiment of the present invention, in order to obtain the top k keywords in the frequency order, it is not necessary to obtain the top k keywords in the frequency order in all the plurality of computers, and only t keywords less than k are obtained. Since the calculation processing load on each computer can be reduced, the number of top keywords in order of frequency in each computer can be narrowed down to obtain the top k keywords in order of frequency as a whole. The response time of the search result can be greatly shortened.

図１は、本発明の実施の形態に係る頻度順キーワード抽出システムの構成を模式的に示すブロック図である。本発明の実施の形態に係る頻度順キーワード抽出システムは、複数のクライアント（コンピュータ）２と、ネットワーク網３を介してデータ通信することが可能に接続されている情報処理装置１で構成されている。情報処理装置１は、少なくともＣＰＵ（中央演算装置）１１、メモリ１２、記憶装置１３、Ｉ／Ｏインタフェース１４、ビデオインタフェース１５、可搬型ディスクドライブ１６、通信インタフェース１７及び上述したハードウェアを接続する内部バス１８で構成されている。 FIG. 1 is a block diagram schematically showing the configuration of a frequency-order keyword extraction system according to an embodiment of the present invention. A frequency-order keyword extraction system according to an embodiment of the present invention includes an information processing apparatus 1 connected to a plurality of clients (computers) 2 so as to be able to perform data communication via a network 3. . The information processing apparatus 1 includes at least a CPU (central processing unit) 11, a memory 12, a storage device 13, an I / O interface 14, a video interface 15, a portable disk drive 16, a communication interface 17, and an internal to which the hardware described above is connected. It consists of a bus 18.

ＣＰＵ１１は、内部バス１８を介して情報処理装置１の上述したようなハードウェア各部と接続されており、上述したハードウェア各部の動作を制御するとともに、記憶装置１３に記憶されたコンピュータプログラム１００に従って、種々のソフトウェア的機能を実行する。メモリ１２は、ＳＲＡＭ、ＳＤＲＡＭ等の揮発性メモリで構成され、コンピュータプログラム１００の実行時にロードモジュールが展開され、コンピュータプログラム１００の実行時に発生する一時的なデータ等を記憶する。 The CPU 11 is connected to the above-described hardware units of the information processing apparatus 1 via the internal bus 18, controls the operation of the above-described hardware units, and follows the computer program 100 stored in the storage device 13. Perform various software functions. The memory 12 is composed of a volatile memory such as SRAM or SDRAM, and a load module is expanded when the computer program 100 is executed, and stores temporary data generated when the computer program 100 is executed.

記憶装置１３は、内蔵される固定型記憶装置（ハードディスク）、ＲＯＭ等で構成されている。記憶装置１３に記憶されたコンピュータプログラム１００は、プログラム及びデータ等の情報を記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体９０から、可搬型ディスクドライブ１６によりダウンロードされ、実行時には記憶装置１３からメモリ１２へ展開して実行される。もちろん、通信インタフェース１７を介して接続されている外部コンピュータからダウンロードされたコンピュータプログラムであっても良い。 The storage device 13 includes a built-in fixed storage device (hard disk), a ROM, and the like. The computer program 100 stored in the storage device 13 is downloaded by the portable disk drive 16 from a portable recording medium 90 such as a DVD or CD-ROM in which information such as programs and data is recorded, and from the storage device 13 at the time of execution. The program is expanded into the memory 12 and executed. Of course, a computer program downloaded from an external computer connected via the communication interface 17 may be used.

通信インタフェース１７は内部バス１８に接続されており、インターネット、ＬＡＮ、ＷＡＮ等の外部のネットワークに接続されることにより、外部コンピュータ等とデータ送受信を行うことが可能となっている。 The communication interface 17 is connected to an internal bus 18 and can transmit / receive data to / from an external computer or the like by connecting to an external network such as the Internet, LAN, or WAN.

Ｉ／Ｏインタフェース１４は、キーボード２１、マウス２２等の入力装置と接続され、データの入力を受け付ける。ビデオインタフェース１５は、ＣＲＴディスプレイ、液晶ディスプレイ等の表示装置２３と接続され、所定の画像を表示する。 The I / O interface 14 is connected to input devices such as a keyboard 21 and a mouse 22 and receives data input. The video interface 15 is connected to a display device 23 such as a CRT display or a liquid crystal display, and displays a predetermined image.

図２は、本発明の実施の形態に係る情報処理装置１の機能ブロック図である。図２において、情報処理装置１のキーワード割当部２０１は、ｎ（ｎは自然数）個のクライアント２に、互いに重なり合わないキーワードを割り当てる。 FIG. 2 is a functional block diagram of the information processing apparatus 1 according to the embodiment of the present invention. In FIG. 2, the keyword assignment unit 201 of the information processing apparatus 1 assigns keywords that do not overlap each other to n (n is a natural number) clients 2.

図３は、本発明の実施の形態に係る頻度順キーワード抽出システムのキーワード割り当ての例示図である。図３では、頻度順上位ｋ個（ｋ＝４）のキーワードを算出するためにｎ個（ｎ＝３）のクライアント２にキーワードを割り当てた状態を示している。具体的には、クライアントＳ１には、キーワードＡ、Ｂ、Ｃ、・・・を、クライアントＳ２にはクライアントＳ１のキーワードと重なり合わないキーワードＥ、Ｆ、Ｇ、・・・を、クライアントＳ３にはクライアントＳ１及びＳ２のキーワードと重なり合わないキーワードＰ、Ｑ、Ｒ、・・・を、それぞれ割り当てる。 FIG. 3 is an exemplary diagram of keyword assignment in the frequency-order keyword extraction system according to the embodiment of the present invention. FIG. 3 shows a state in which keywords are assigned to n (n = 3) clients 2 in order to calculate the top k (k = 4) keywords in order of frequency. Specifically, the keywords A, B, C,... Are assigned to the client S1, the keywords E, F, G,. The keywords P, Q, R,... That do not overlap with the keywords of the clients S1 and S2 are respectively assigned.

各クライアント２は、割り当てられたキーワードについて出現頻度を求め、頻度順にキーワードを列挙する。そして、全てのクライアント２で列挙されたキーワードと出現頻度とを情報処理装置１が取得し、全体として頻度順上位４個（ｋ＝４）のキーワード、図３の例では頻度順上位４つのキーワードＥ、Ａ、Ｐ、Ｆを頻度順に求める。 Each client 2 obtains the appearance frequency for the assigned keyword, and lists the keywords in order of frequency. Then, the information processing apparatus 1 acquires the keywords and appearance frequencies listed in all the clients 2, and as a whole, the top four keywords in the order of frequency (k = 4), the top four keywords in the order of frequency in the example of FIG. E, A, P, and F are obtained in order of frequency.

図２に戻って、確率算出部２０２は、各クライアント２に割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数）個のキーワードを選択する場合に、各クライアント２の頻度順上位のキーワードが頻度順上位ｋ個となる確率Ｐ（ｎ、ｋ、ｔ）を算出する。確率Ｐ（ｎ、ｋ、ｔ）の算出方法は特に限定するものではない。 Returning to FIG. 2, the probability calculation unit 202 acquires keywords in the highest order of frequency from the keywords assigned to each client 2, and k keywords (k is a natural number) in the order of frequency as a whole from all the acquired keywords. Is selected, the probability P (n, k, t) that the top keyword in the frequency order of each client 2 is k is calculated. The calculation method of the probability P (n, k, t) is not particularly limited.

例えば、確率Ｐ（ｎ、ｋ、ｔ）を、ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを選択する場合に、「ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを選択する場合に、選択するコンピュータごとのキーワード数の組み合わせにおいて、いずれのコンピュータからも多くとも（ｔ−１）個のキーワードを選択する」確率として求めても良い（確率算出方法（１））。各クライアント２で取得したｔ個のキーワードを集約したｎ×ｔ個のキーワードからｋ個を選ぶ組み合わせをｆ（ｎ、ｋ、ｔ）とした場合、確率Ｐ（ｎ、ｋ、ｔ）は（式１）に示すように定義される。 For example, when selecting the top k keywords in order of frequency from n × t keywords with probability P (n, k, t), “select top k keywords in order of frequency from n × t keywords In this case, in the combination of the number of keywords for each computer to be selected, it may be obtained as a probability of “selecting at most (t−1) keywords from any computer” (probability calculation method (1)). When a combination of selecting k keywords from n × t keywords obtained by aggregating t keywords acquired by each client 2 is f (n, k, t), the probability P (n, k, t) is expressed by It is defined as shown in 1).

（式１）において、組み合わせｆ（ｎ、ｋ、ｔ）は、（式２）で与えられる。したがって、（式２）のｎ、ｋ、ｔに値を代入することで、確率Ｐ（ｎ、ｋ、ｔ）は直接算出することができる。 In (Expression 1), the combination f (n, k, t) is given by (Expression 2). Accordingly, the probability P (n, k, t) can be directly calculated by substituting values for n, k, and t in (Expression 2).

（式２）に基づく演算処理の計算量は、０．５×ｎ×（ｋ／ｔ）2 と算出される。一方、動的計画法を用いてｆ（ｎ、ｋ、ｔ）を算出することもできる。（式３）は、動的計画法を用いてｆ（ｎ、ｋ、ｔ）を算出する場合の漸化式である。 The calculation amount of the arithmetic processing based on (Equation 2) is calculated as 0.5 × n × (k / t) 2. On the other hand, f (n, k, t) can also be calculated using dynamic programming. (Equation 3) is a recurrence formula when f (n, k, t) is calculated using dynamic programming.

また、（式３）に基づく演算処理の計算量は、ｎ×ｋと算出される。したがって、例えばｔの値がｋの５％である場合、（ｋ／ｔ）2 は４００となるので、ｋ＞２００であれば（式２）の方が演算処理の計算量が少なくて済む。 Further, the calculation amount of the arithmetic processing based on (Equation 3) is calculated as n × k. Therefore, for example, when the value of t is 5% of k, (k / t) 2 is 400. Therefore, if k> 200, the amount of calculation processing is smaller in (Equation 2).

また、確率Ｐ（ｎ、ｋ、ｔ）を、ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを頻度順に選択する場合に、「ｎ×ｔ個のキーワードから頻度順上位ｋ個のキーワードを頻度順序を含めて選択する場合に、選択するコンピュータごとのキーワード数の組み合わせにおいて、いずれのコンピュータからも多くとも（ｔ−１）個のキーワードを選択する」確率として求めても良い（確率算出方法（２））。この場合、頻度順上位ｋ個のキーワードの頻度順を考慮するので、ｋ個のキーワードを区別して割り当てる割り当て方の組み合わせを用いて、確率Ｐ（ｎ、ｋ、ｔ）を求めることができる。すなわち、正しい頻度順上位ｋ個のキーワードをｎ個のクライアント２へ順次割り当てる組み合わせのうち、各クライアント２へ順次割り当てられるキーワードの個数がｔ個以下である割り当て方の組み合わせをｇ（ｎ、ｋ、ｔ）とした場合、確率Ｐ（ｎ、ｋ、ｔ）は（式４）に示すように定義される。 Further, when the probability P (n, k, t) is selected in order of frequency from the n × t keywords in the order of frequency, “k keywords in the order of frequency from the n × t keywords” are selected. May be determined as a probability of selecting (t−1) keywords from any computer in a combination of the number of keywords for each computer to be selected (probability calculation). Method (2)). In this case, since the frequency order of the top k keywords in the frequency order is taken into consideration, the probability P (n, k, t) can be obtained using a combination of assignment methods that distinguish and assign k keywords. That is, among the combinations that sequentially assign the top k keywords in the correct frequency order to the n clients 2, the combination of assignment methods in which the number of keywords sequentially assigned to each client 2 is t or less is g (n, k, t), the probability P (n, k, t) is defined as shown in (Equation 4).

（式４）において、動的計画法を用いてｇ（ｎ、ｋ、ｔ）を算出することができる。（式５）は、動的計画法を用いてｇ（ｎ、ｋ、ｔ）を算出する場合の漸化式である。 In (Expression 4), g (n, k, t) can be calculated using dynamic programming. (Formula 5) is a recurrence formula when g (n, k, t) is calculated using dynamic programming.

なお、確率Ｐ（ｎ、ｋ、ｔ）を、「ｎ個のコンピュータから頻度順上位ｋ個のキーワードを選択する組み合わせを選んだ場合に、選んだ組み合わせにおいて、いずれのコンピュータからも多くともｔ個のキーワードを選択する」確率として求めても良い（確率算出方法（３））。正しい頻度順上位ｋ個のキーワードをｎ個のクライアント２へ割り当てる場合に、各クライアント２へ割り当てられるキーワードの個数がｔ個以下である組み合わせｇ（ｎ、ｋ、ｔ）を用いて、確率Ｐ（ｎ、ｋ、ｔ）は（式６）に示すように定義される。なお、ｇ（ｎ、ｋ、∞）は、ｔの制限がない場合の組み合わせの個数を示すため、ｎのｋ乗に等しい。 The probability P (n, k, t) is set to “when a combination that selects k keywords in order of frequency from n computers is selected, at most t from any computer in the selected combination. As a probability of “selecting a keyword” (probability calculation method (3)). When the top k keywords in the correct frequency order are assigned to n clients 2, the probability P () is obtained using a combination g (n, k, t) in which the number of keywords assigned to each client 2 is t or less. (n, k, t) are defined as shown in (Equation 6). Note that g (n, k, ∞) indicates the number of combinations when there is no limit of t, and is therefore equal to n raised to the kth power.

（式６）において、動的計画法を用いてｇ（ｎ、ｋ、ｔ）を算出する場合の漸化式は（式５）と同様となる。 In (Expression 6), the recurrence formula when g (n, k, t) is calculated using dynamic programming is the same as (Expression 5).

図４は、本発明の実施の形態に係る頻度順キーワード抽出システムの確率Ｐ（ｎ、ｋ、ｔ）の算出方法を説明するための模式図である。図４（ａ）は、ｎ個のクライアントＳ１、Ｓ２、・・・、Ｓｎから頻度順上位ｔ個ずつ取得したｎ×ｔ個のキーワードを示している。 FIG. 4 is a schematic diagram for explaining a method of calculating the probability P (n, k, t) of the frequency-order keyword extraction system according to the embodiment of the present invention. FIG. 4A shows n × t keywords acquired from the n clients S1, S2,...

図４（ｂ）は、図４（ａ）に示すｎ×ｔ個のキーワードの中から、頻度順上位ｋ個を選択した状態を示している。図４（ｂ）のハッチング部分が、選択されたキーワードを示している。図４（ｂ）の例では、クライアントＳ２から取得したキーワードが最も多いｔ個選択されており、この場合、すべてのクライアントから頻度順上位ｔ個ずつキーワードを取得しないと、全体として頻度順上位ｋ個のキーワードを正しく選択することができない。 FIG. 4B shows a state where the top k items in the order of frequency are selected from the n × t keywords shown in FIG. The hatched portion in FIG. 4B indicates the selected keyword. In the example of FIG. 4B, t keywords having the largest number of keywords acquired from the client S2 are selected. In this case, if the top t keywords in frequency order are not acquired from all clients, the top k in order of frequency as a whole. Keywords cannot be selected correctly.

図４（ｃ）も、図４（ａ）に示すｎ×ｔ個のキーワードの中から、頻度順上位ｋ個を選択した状態を示している。図４（ｃ）のハッチング部分が、選択されたキーワードを示している。図４（ｃ）の例では、クライアントＳｎから取得したキーワードが最も多い（ｔ−１）個選択されており、ｔ個のキーワードが選択されたクライアントは存在しない。この場合は、全体として頻度順上位ｋ個のキーワードを正しく選択することができる。 FIG. 4C also shows a state in which the top k items in the order of frequency are selected from the n × t keywords shown in FIG. The hatched portion in FIG. 4C indicates the selected keyword. In the example of FIG. 4C, (t−1) keywords having the largest number of keywords acquired from the client Sn are selected, and there are no clients for which t keywords have been selected. In this case, the top k keywords in order of frequency can be correctly selected as a whole.

確率算出方法（１）では、ハッチング部分の総数を、キーワードの頻度順序を区別することなく選択する。つまり、一のクライアントＳｉから選択されている限り、頻度順序が何位であろうと区別することはない。一方、確率算出方法（２）、（３）では、キーワードの頻度順序を区別して選択する。 In the probability calculation method (1), the total number of hatched portions is selected without distinguishing the keyword frequency order. That is, as long as it is selected from one client Si, it does not distinguish what the frequency order is. On the other hand, in the probability calculation methods (2) and (3), the keyword frequency order is distinguished and selected.

キーワードの頻度順序を区別せずに選択する場合、どのクライアントからも均等にキーワードが選択される場合とクライアントごとに選択されるキーワードの個数に偏りがある場合とを同じ重みで考慮することになる。ｋ個のキーワードをクライアントへランダムに割り当てる場合、均等にキーワードが選択される可能性の方が高い。したがって、キーワードの頻度順序を区別せずに選択する確率算出方法（１）の方が、より大きくｔの値を見積もることになる。 When selecting without distinguishing the frequency order of keywords, the case where keywords are selected equally from all clients and the case where the number of keywords selected for each client is biased are considered with the same weight. . When k keywords are randomly assigned to clients, there is a higher possibility that keywords will be selected equally. Therefore, the probability calculation method (1) in which the keyword frequency order is selected without distinction will estimate the value of t larger.

図２に戻って、キーワード取得個数推定部（推定手段）２０３は、算出した確率Ｐ（ｎ、ｋ、ｔ）に基づいて、各クライアント２から頻度順上位のキーワードとして取得するキーワードの個数ｔ（ｔは自然数、ｔ＜ｋ）を推定する。具体的には、確率算出部２０２で算出した確率Ｐ（ｎ、ｋ、ｔ）が、所定の閾値確率ｐ以上であるｔを推定する。 Returning to FIG. 2, the keyword acquisition number estimating unit (estimating means) 203 acquires the number t () of keywords to be acquired from each client 2 as the top keyword in frequency order based on the calculated probability P (n, k, t). t is a natural number, and t <k) is estimated. Specifically, the probability P (n, k, t) calculated by the probability calculation unit 202 is estimated to be t that is equal to or higher than a predetermined threshold probability p.

キーワード取得部２０４は、各クライアント２から頻度順上位ｔ個のキーワードを取得する。キーワード選択部２０５は、ｎ個のクライアント２から取得したすべてのキーワードから頻度順上位ｋ（ｋは自然数）個のキーワードを選択する。 The keyword acquisition unit 204 acquires the top t keywords in the order of frequency from each client 2. The keyword selection unit 205 selects k keywords (k is a natural number) in order of frequency from all the keywords acquired from the n clients 2.

結果通知部２０６は、選択したキーワードが正確であるか否かに関する情報をユーザに通知する。ユーザへの通知方法は特に限定されるものではなく、表示装置２３に表示出力しても良いし、固定されたメッセージを所定のアドレスへメール送信しても良い。 The result notification unit 206 notifies the user of information regarding whether or not the selected keyword is accurate. The notification method to the user is not particularly limited, and may be displayed and output on the display device 23, or a fixed message may be sent by e-mail to a predetermined address.

具体的には、選択された頻度順上位ｋ個のキーワードが、各クライアント２の頻度順上位（ｔ−１）個から選択されていて、かつ各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が異なる場合、結果通知部２０６は、選択結果が正確である旨を通知する。それ以外の場合、すなわちいずれかのクライアント２で頻度順上位ｔ個の中からキーワードが選択されている場合、頻度順上位ｔ個のキーワードのうち最も出現頻度が大きいキーワードの頻度よりも大きい頻度を有するキーワードについては、選択結果が正確である旨を通知する。 Specifically, the selected top k keywords in the frequency order are selected from the top (t−1) keywords in the order of frequency of each client 2, and the minimum frequency among the keywords selected in each client 2 is selected. When the frequency of the t-th keyword of the client 2 is different from that of the client 2, the result notification unit 206 notifies that the selection result is accurate. In other cases, that is, when any one of the clients 2 selects a keyword from the top t keywords in the frequency order, the frequency higher than the frequency of the keyword having the highest appearance frequency among the top t keywords in the frequency order is set. For the keywords that have them, the fact that the selection result is accurate is notified.

図５は、本発明の実施の形態に係る情報処理装置１のＣＰＵ１１の処理手順を示すフローチャートである。図５において、情報処理装置１のＣＰＵ１１は、データ通信することが可能に接続してあるｎ（ｎは自然数）個のクライアント２に、互いに重なり合わないキーワードを割り当てる（ステップＳ５０１）。 FIG. 5 is a flowchart showing the processing procedure of the CPU 11 of the information processing apparatus 1 according to the embodiment of the present invention. In FIG. 5, the CPU 11 of the information processing apparatus 1 assigns keywords that do not overlap each other to n (n is a natural number) clients 2 that are connected so as to be able to perform data communication (step S501).

ＣＰＵ１１は、各クライアント２に割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数）個のキーワードを選択する場合に、各クライアント２の頻度順上位のキーワードが頻度順上位ｋ個となる確率Ｐ（ｎ、ｋ、ｔ）を算出する（ステップＳ５０２）。確率Ｐ（ｎ、ｋ、ｔ）の算出方法は上述した確率算出方法（１）乃至（３）のいずれかを用いる。 When the CPU 11 acquires a keyword in the highest order of frequency from the keywords assigned to each client 2 and selects k keywords (k is a natural number) in order of frequency from all the acquired keywords as a whole, each client 2 The probability P (n, k, t) that the top keywords in the frequency order are k in the frequency order is calculated (step S502). Any of the above-described probability calculation methods (1) to (3) is used as a calculation method of the probability P (n, k, t).

ＣＰＵ１１は、算出した確率Ｐ（ｎ、ｋ、ｔ）に基づいて、各クライアント２から頻度順上位のキーワードとして取得するキーワードの個数ｔ（ｔは自然数、ｔ＜ｋ）を推定する（ステップＳ５０３）。具体的には、確率算出部２０２で算出した確率Ｐ（ｎ、ｋ、ｔ）が、所定の閾値確率ｐ以上であるｔを推定する。 Based on the calculated probability P (n, k, t), the CPU 11 estimates the number t of keywords (t is a natural number, t <k) acquired from each client 2 as the top keyword in the order of frequency (step S503). . Specifically, the probability P (n, k, t) calculated by the probability calculation unit 202 is estimated to be t that is equal to or higher than a predetermined threshold probability p.

ＣＰＵ１１は、各クライアント２から頻度順上位ｔ個のキーワードを取得する（ステップＳ５０４）。ＣＰＵ１１は、ｎ個のクライアント２から取得したｎ×ｔ個のキーワードから全体として頻度順上位ｋ（ｋは自然数）個のキーワードを選択する（ステップＳ５０５）。 The CPU 11 acquires the top t keywords in the order of frequency from each client 2 (step S504). The CPU 11 selects k keywords (k is a natural number) in order of frequency from the n × t keywords acquired from the n clients 2 as a whole (step S505).

ＣＰＵ１１は、選択された頻度順上位ｋ個のキーワードが、各クライアント２の頻度順上位（ｔ−１）個から選択され、かつ各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が異なっているか否かを判断する（ステップＳ５０６）。ＣＰＵ１１が、各クライアント２の頻度順上位（ｔ−１）個から選択されており、かつ各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が異なっていると判断した場合（ステップＳ５０６：ＹＥＳ）、ＣＰＵ１１は、選択結果が正確である旨を示すメッセージを表示出力する（ステップＳ５０７）。ＣＰＵ１１が、いずれかのクライアント２では、頻度順上位ｔ個の中から選択されており、又は各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が同じであると判断した場合（ステップＳ５０６：ＮＯ）、ＣＰＵ１１は、ｔ番目のキーワードの頻度よりも大きい頻度を有するキーワードについては、選択結果が正確である旨を示すメッセージを表示出力する（ステップＳ５０８）。 The CPU 11 selects the top k keywords in the selected frequency order from the top (t−1) keywords in the frequency order of each client 2, and sets the minimum frequency among the keywords selected in each client 2 and the client 2. It is determined whether or not the frequency of the t-th keyword is different (step S506). The CPU 11 is selected from the top (t−1) in order of frequency of each client 2, and the minimum frequency among the keywords selected in each client 2 is different from the frequency of the t-th keyword of the client 2. If the CPU 11 determines that the selection result is correct (step S506: YES), the CPU 11 displays and outputs a message indicating that the selection result is accurate (step S507). The CPU 11 is selected from the top t in the order of frequency in any of the clients 2, or the minimum frequency among the keywords selected in each client 2 is the same as the frequency of the t-th keyword of the client 2. If it is determined that it is (step S506: NO), the CPU 11 displays and outputs a message indicating that the selection result is accurate for a keyword having a frequency greater than the frequency of the t-th keyword (step S508). .

上述した処理を換言すれば、頻度順上位ｋ個のキーワードが、各クライアント２の頻度順上位（ｔ−１）個から選択され、かつ各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が異なっている場合には、選択された頻度順上位ｋ個のキーワードが正確であるという事実を利用している。したがって、各クライアント２からキーワードの頻度順上位ｔ（ｔは自然数）個のキーワードを取得して、全体として選択された頻度順上位ｋ個のキーワードが、各クライアント２の頻度順上位（ｔ−１）個から選択され、かつ各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が異なっているか否かを判断することにより、正確にキーワードを選択したか否かを判断することができる。図６は、本発明の実施の形態に係る情報処理装置１の、確率を算出しない場合の機能ブロック図である。図６において、情報処理装置１のキーワード割当部６０１は、ｎ（ｎは自然数）のクライアント２に、互いに重なり合わないキーワードを割り当てる。 In other words, the top k keywords in the frequency order are selected from the top (t−1) keywords in the frequency order of each client 2 and the minimum frequency among the keywords selected in each client 2 and its keywords When the frequency of the t-th keyword of the client 2 is different, the fact that the top k keywords in the selected frequency order are accurate is used. Therefore, the top t frequency keywords (t is a natural number) are obtained from each client 2, and the k top keywords in order of frequency selected as a whole become the top frequency order (t−1) of each client 2. ) Whether the keyword has been selected correctly by determining whether or not the minimum frequency of the keywords selected in each client 2 is different from the frequency of the t-th keyword of the client 2 It can be determined whether or not. FIG. 6 is a functional block diagram of the information processing apparatus 1 according to the embodiment of the present invention when the probability is not calculated. In FIG. 6, the keyword assignment unit 601 of the information processing device 1 assigns keywords that do not overlap each other to n (n is a natural number) clients 2.

キーワード取得部６０２は、各クライアント２から頻度順上位ｔ個のキーワードを取得する。判断部６０３は、各クライアント２に割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数、ｔ＜ｋ）個のキーワードを選択する場合に、各クライアント２で頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であるか否かを判断する。 The keyword acquisition unit 602 acquires the top t keywords in order of frequency from each client 2. The determination unit 603 acquires keywords in the highest order of frequency from the keywords assigned to each client 2 and selects k keywords in the order of frequency (k is a natural number, t <k) as a whole from all the acquired keywords. In this case, it is determined whether or not the number of keywords acquired as the highest-order keyword in each client 2 is (t−1) or less.

キーワード選択部６０４は、判断部６０３で各クライアント２で頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であると判断した場合、各クライアント２から取得したすべてのキーワードのうち、各クライアント２の頻度順上位（ｔ−１）個のキーワードから全体として頻度順上位ｋ個のキーワードを選択する。 When the keyword selection unit 604 determines that the number of keywords acquired by each client 2 as the highest-order keyword in each client 2 is (t−1) or less by the determination unit 603, all keywords acquired from each client 2 are obtained. Among them, the top k keywords in order of frequency are selected from the top (t−1) keywords in order of frequency of each client 2.

結果通知部６０５は、選択したキーワードを頻度順上位から順に表示装置２３に表示出力する。 The result notification unit 605 displays and outputs the selected keywords on the display device 23 in order from the highest frequency.

図７は、本発明の実施の形態に係る情報処理装置１のＣＰＵ１１の、確率を算出しない場合の処理手順を示すフローチャートである。図７において、情報処理装置１のＣＰＵ１１は、データ通信することが可能に接続してあるｎ（ｎは自然数）個のクライアント２に、互いに重なり合わないキーワードを割り当てる（ステップＳ７０１）。 FIG. 7 is a flowchart showing a processing procedure when the CPU 11 of the information processing apparatus 1 according to the embodiment of the present invention does not calculate the probability. In FIG. 7, the CPU 11 of the information processing apparatus 1 assigns keywords that do not overlap each other to n (n is a natural number) clients 2 connected so as to be able to perform data communication (step S <b> 701).

ＣＰＵ１１は、各クライアント２から頻度順上位ｔ個のキーワードを取得し（ステップＳ７０２）、各クライアント２に割り当てられたキーワードから頻度順上位のキーワードを取得し、取得したすべてのキーワードから全体として頻度順上位ｋ（ｋは自然数、ｔ＜ｋ）個のキーワードを選択する場合に、各クライアント２で頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であり、かつ各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が異なっているか否かを判断する（ステップＳ７０３）。 The CPU 11 acquires the top t keywords in the order of frequency from each client 2 (step S702), acquires the keywords in the top in the order of frequency from the keywords assigned to each client 2, and orders all the acquired keywords in the order of frequency. When the top k (k is a natural number, t <k) keywords are selected, the number of keywords acquired as the top keywords in the frequency order in each client 2 is (t−1) or less, and each client It is determined whether or not the minimum frequency among the keywords selected in 2 is different from the frequency of the t-th keyword of the client 2 (step S703).

ＣＰＵ１１が、各クライアント２で頻度順上位のキーワードとして取得されたキーワードの個数が（ｔ−１）個以下であり、かつ各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が異なっていると判断した場合（ステップＳ７０３：ＹＥＳ）、ＣＰＵ１１は、各クライアント２から取得したすべてのキーワードのうち、各クライアント２の頻度順上位（ｔ−１）個のキーワードから全体として頻度順上位ｋ個のキーワードを選択する（ステップＳ７０４）。ＣＰＵ１１は、選択されたキーワードを頻度順上位から順に表示装置２３に表示出力する（ステップＳ７０５）。 The number of keywords acquired by the CPU 11 as the top keywords in the frequency order in each client 2 is (t−1) or less, and the minimum frequency among the keywords selected in each client 2 and the t of the client 2 If it is determined that the frequency of the second keyword is different (step S703: YES), the CPU 11 selects all the keywords acquired from each client 2 in the order of frequency (t−1) keywords of each client 2. As a whole, the top k keywords in the order of frequency are selected (step S704). The CPU 11 displays and outputs the selected keywords on the display device 23 in order from the highest frequency order (step S705).

ＣＰＵ１１が、いずれかのクライアント２で頻度順上位として取得されたキーワードの個数がｔ個である、又は各クライアント２において選択されたキーワードのうち最小の頻度とそのクライアント２のｔ番目のキーワードの頻度が同じであると判断した場合（ステップＳ７０３：ＮＯ）、ＣＰＵ１１は、ｔ番目のキーワードの頻度よりも大きい頻度を有するキーワードについては、選択結果が正確である旨を示すメッセージを表示出力する（ステップＳ７０６）。 The number of keywords acquired by the CPU 11 as a higher frequency order in any one of the clients 2 is t, or the minimum frequency among the keywords selected in each client 2 and the frequency of the t-th keyword of the client 2 Are determined to be the same (step S703: NO), the CPU 11 displays and outputs a message indicating that the selection result is accurate for a keyword having a frequency greater than that of the t-th keyword (step S703). S706).

上述した頻度順キーワード抽出システムを、いわゆるテキストマイニングにおける、文書データからの頻度順キーワードの抽出に適用する。例えば図１と同様の構成の頻度順キーワード抽出システムにおいて、ｎ個のクライアント２をクライアントＳ１、Ｓ２、・・・、Ｓｎと区別しておく。そして、キーワードごとに周知の方法でハッシュ値を算出しておき、算出してあるハッシュ値をクライアント数ｎ（ｎは自然数）で除算した剰余に１を加算した値を引数ｉとしてクライアントＳ１、Ｓ２、・・・、Ｓｎにキーワードを割り当てる。 The frequency order keyword extraction system described above is applied to the extraction of frequency order keywords from document data in so-called text mining. For example, in the frequency-order keyword extraction system having the same configuration as in FIG. 1, n clients 2 are distinguished from clients S1, S2,. Then, a hash value is calculated for each keyword by a known method, and a value obtained by adding 1 to a remainder obtained by dividing the calculated hash value by the number of clients n (n is a natural number) is used as an argument i. ,..., Sn are assigned keywords.

ユーザは、頻度順上位ｋ個を抽出するために、選択個数ｋ及び所定の閾値確率ｐを指定する。図８は、選択個数ｋ及び閾値確率ｐの指定を受け付ける画面の例示図である。図８の例では、選択個数ｋの指定を受け付ける選択個数受付領域８１と、閾値確率ｐの指定を受け付ける確率指定領域８２とを有している。指定を受け付けた状態で「検索」ボタン８３の選択を受け付けることにより、頻度順上位ｋ個のキーワードを抽出することができる。 The user designates the selected number k and a predetermined threshold probability p in order to extract the top k in order of frequency. FIG. 8 is a view showing an example of a screen for accepting designation of the selected number k and the threshold probability p. The example of FIG. 8 includes a selection number receiving area 81 that receives the designation of the selection number k, and a probability designation area 82 that receives the designation of the threshold probability p. By accepting the selection of the “search” button 83 in the state of accepting the designation, the top k keywords in the order of frequency can be extracted.

もちろん、指定を受け付けた選択個数ｋ及び閾値確率ｐに応じて、Ｐ（ｎ、ｋ、ｔ）＞ｐを満たすｔの値を、上述した確率算出方法（１）乃至（３）のいずれかを用いて事前に算出しておくことは言うまでもない。 Of course, the value of t satisfying P (n, k, t)> p is set to any one of the above probability calculation methods (1) to (3) according to the selected number k and the threshold probability p received. Needless to say, use and calculate in advance.

図９は、確率算出方法ごとのｔの計算値の例を示す表である。例えばｎ＝３２、ｋ＝１００である場合、確率９０％以上で正しい頻度順上位ｋ個のキーワードを選択することができるｔの値は、確率算出方法（１）ではｔ＝１６であり、確率算出方法（２）及び（３）ではｔ＝９である。また、確率９５％以上で正しい頻度順上位ｋ個のキーワードを選択することができるｔの値は、確率算出方法（１）ではｔ＝１８であり、確率算出方法（２）及び（３）ではｔ＝１０である。いずれの場合もｔの値は、ｋ＝１００に比べて少ない。 FIG. 9 is a table showing an example of a calculated value of t for each probability calculation method. For example, when n = 32 and k = 100, the value of t that can select the top k keywords in the correct frequency order with a probability of 90% or more is t = 16 in the probability calculation method (1), and the probability In the calculation methods (2) and (3), t = 9. In addition, the value of t that can select the top k keywords in the correct frequency order with a probability of 95% or more is t = 18 in the probability calculation method (1), and in the probability calculation methods (2) and (3). t = 10. In either case, the value of t is smaller than k = 100.

図１０は、ｎ＝３２、ｐ＝９０％である場合の頻度順上位ｋ個のキーワードを選択する処理の実行時間（秒）を示す表である。図１０では、ｋ＝１００、１０００、１００００とした場合の実行時間を、それぞれ示している。「従来」とあるのは従来手法での実行時間を示し、「１」とあるのは確率算出方法（１）での実行時間を示している。図１０からもわかるように、本実施の形態に係る頻度順キーワード抽出システムを用いることにより、演算処理速度は１．５倍から２．５倍高速化されている。 FIG. 10 is a table showing the execution time (seconds) of the process of selecting the top k keywords in order of frequency when n = 32 and p = 90%. FIG. 10 shows execution times when k = 100, 1000, and 10000, respectively. “Conventional” indicates the execution time in the conventional method, and “1” indicates the execution time in the probability calculation method (1). As can be seen from FIG. 10, by using the frequency order keyword extraction system according to the present embodiment, the calculation processing speed is increased from 1.5 times to 2.5 times.

以上のように本実施の形態によれば、頻度順上位ｋ個のキーワードを求めるために、複数のクライアント２すべてにおいて頻度順上位ｋ個のキーワードを求める必要がなく、ｋ個より少ないｔ個のキーワードのみを求めれば良いので、各クライアント２での演算処理負荷を軽減することができるとともに、全体として頻度順上位ｋ個のキーワードを求めるために各クライアント２で求める頻度順上位のキーワードの個数を絞り込むことができるので、検索結果のレスポンスタイムを大きく短縮することが可能となる。 As described above, according to the present embodiment, in order to obtain the top k keywords in the frequency order, it is not necessary to obtain the top k keywords in the frequency order in all of the plurality of clients 2, and t less than k keywords are obtained. Since only the keywords need to be obtained, the calculation processing load on each client 2 can be reduced, and in order to obtain the top k keywords in order of frequency as a whole, the number of keywords in the top order of frequency obtained in each client 2 is determined. Since it can narrow down, it becomes possible to greatly shorten the response time of the search result.

なお、本発明は上記実施例に限定されるものではなく、本発明の趣旨の範囲内であれば多種の変更、改良等が可能である。例えば、情報処理装置１を別個に設けるのではなく、クライアント２の内の１台を頻度順上位のキーワードを求める装置としても良い。 The present invention is not limited to the above-described embodiments, and various changes and improvements can be made within the scope of the present invention. For example, instead of providing the information processing apparatus 1 separately, one of the clients 2 may be an apparatus that obtains a keyword in order of frequency.

１情報処理装置
２クライアント
１１ＣＰＵ
１２メモリ
１３記憶装置
１４Ｉ／Ｏインタフェース
１５ビデオインタフェース
１６可搬型ディスクドライブ
１７通信インタフェース
１８内部バス
９０可搬型記録媒体
１００コンピュータプログラム 1 Information processing device 2 Client 11 CPU
12 Memory 13 Storage Device 14 I / O Interface 15 Video Interface 16 Portable Disk Drive 17 Communication Interface 18 Internal Bus 90 Portable Recording Medium 100 Computer Program

Claims

It is a method that can be executed by a device that obtains keywords in order of frequency using a plurality of computers,
The device is
assigning non-overlapping keywords to n (n is a natural number) computers;
When a keyword with the highest frequency order is acquired from the keywords assigned to each computer and k keywords with the highest frequency order (k is a natural number) are selected as a whole from all the acquired keywords, Calculating a probability P (n, k, t) that the keyword is the top k in frequency order;
Based on the calculated probability P (n, k, t), estimating the number t of keywords (t is a natural number, t <k) to be acquired from each computer as the top keyword in the order of frequency, and
A method of acquiring the top t keywords in frequency order from each computer and selecting the top k keywords in order of frequency from all the acquired keywords.

The probability P (n, k, t) can be determined from any computer in the combination of the number of keywords for each computer to be selected when the top k keywords in order of frequency are selected from the n × t keywords ( The method according to claim 1, wherein t−1) is a probability of selecting keywords.

The probability P (n, k, t) is any computer in the combination of the number of keywords for each computer to be selected when the top k keywords in order of frequency are selected from the n × t keywords including the frequency order. The method according to claim 1, wherein the probability is that at most (t−1) keywords are selected.

The probability P (n, k, t) is determined by selecting at most t keywords from any computer in the selected combination when selecting a combination that selects k keywords in order of frequency from n computers. The method of claim 1, wherein the method is a probability of selection.

It is a method that can be executed by a device that obtains keywords in order of frequency using a plurality of computers,
The device is
assigning non-overlapping keywords to n (n is a natural number) computers;
Obtaining t keywords (t is a natural number) in order of frequency from each computer;
When acquiring the top keywords in the order of frequency from the keywords assigned to each computer, and selecting all k keywords (k is a natural number, t <k) in order of frequency from all the acquired keywords, Whether or not the number of keywords acquired as the top keywords in the frequency order is (t-1) or less, and the minimum frequency among the keywords selected in each client is different from the frequency of the t-th keyword of the client A step of determining whether or not
The number of keywords acquired as high-order keywords in each computer is (t−1) or less, and the minimum frequency among the keywords selected in each client and the frequency of the t-th keyword of that client are A method of selecting the top k keywords in order of frequency from the top (t−1) keywords in order of frequency among all the keywords acquired from each computer when it is determined that they are different.

A device that uses a plurality of computers to obtain keywords in order of frequency,
a keyword assigning unit for assigning non-overlapping keywords to n (n is a natural number) computers;
When a keyword with the highest frequency order is acquired from the keywords assigned to each computer and k keywords with the highest frequency order (k is a natural number) are selected as a whole from all the acquired keywords, A probability calculating unit that calculates the probability P (n, k, t) that the keyword is the top k in frequency order;
An estimation unit that estimates the number of keywords t (t is a natural number, t <k) to be acquired from each computer as a top keyword in order of frequency based on the calculated probability P (n, k, t);
An apparatus that acquires the top t keywords in order of frequency from each computer, and selects the top k keywords in order of frequency from all the acquired keywords.

The probability P (n, k, t) can be determined from any computer in the combination of the number of keywords for each computer to be selected when the top k keywords in order of frequency are selected from the n × t keywords ( The apparatus according to claim 6, wherein t−1) is a probability of selecting keywords.

The probability P (n, k, t) is any computer in the combination of the number of keywords for each computer to be selected when the top k keywords in order of frequency are selected from the n × t keywords including the frequency order. The apparatus according to claim 6, wherein the probability is that at most (t−1) keywords are selected.

The probability P (n, k, t) is determined by selecting at most t keywords from any computer in the selected combination when selecting a combination that selects k keywords in order of frequency from n computers. The apparatus of claim 6, wherein the apparatus is a probability of selection.

A device that uses a plurality of computers to obtain keywords in order of frequency,
a keyword assigning unit for assigning non-overlapping keywords to n (n is a natural number) computers;
A keyword acquisition unit that acquires the top t (t is a natural number) keywords in order of frequency from each computer;
When acquiring the top keywords in the order of frequency from the keywords assigned to each computer, and selecting all k keywords (k is a natural number, t <k) in order of frequency from all the acquired keywords, Whether or not the number of keywords acquired as the top keywords in the frequency order is (t-1) or less, and the minimum frequency among the keywords selected in each client is different from the frequency of the t-th keyword of the client And a determination unit for determining whether
The number of keywords acquired as high-order keywords in each computer is (t−1) or less, and the minimum frequency among the keywords selected in each client and the frequency of the t-th keyword of that client are An apparatus that selects k keywords as a whole in order of frequency from the top (t−1) keywords in order of frequency among all keywords acquired from each computer when it is determined that they are different.

A computer program that can be executed by a device that determines a keyword in order of frequency using a plurality of computers,
Said device,
keyword assigning means for assigning non-overlapping keywords to n (n is a natural number) computers;
When a keyword with the highest frequency order is acquired from the keywords assigned to each computer and k keywords with the highest frequency order (k is a natural number) are selected as a whole from all the acquired keywords, A probability calculating means for calculating a probability P (n, k, t) that the keyword has the top k in the frequency order;
Based on the calculated probability P (n, k, t), an estimation means for estimating the number t of keywords to be acquired from each computer as the top keyword in the order of frequency (t is a natural number, t <k), and the frequency from each computer A computer program that obtains the top t keywords in order and functions as means for selecting the top k keywords in order of frequency from all the obtained keywords.

The probability P (n, k, t) can be determined from any computer in the combination of the number of keywords for each computer to be selected when the top k keywords in order of frequency are selected from the n × t keywords ( The computer program according to claim 11, which is a probability of selecting t−1) keywords.

The probability P (n, k, t) is any computer in the combination of the number of keywords for each computer to be selected when the top k keywords in order of frequency are selected from the n × t keywords including the frequency order. The computer program according to claim 11, wherein the computer program has a probability of selecting at most (t−1) keywords.

The probability P (n, k, t) is determined by selecting at most t keywords from any computer in the selected combination when selecting a combination that selects k keywords in order of frequency from n computers. The computer program according to claim 11, wherein the computer program is a probability of selection.

A computer program that can be executed by a device that determines a keyword in order of frequency using a plurality of computers,
Said device,
keyword assigning means for assigning non-overlapping keywords to n (n is a natural number) computers;
Keyword acquisition means for acquiring t keywords (t is a natural number) in order of frequency from each computer;
When acquiring the top keywords in the order of frequency from the keywords assigned to each computer, and selecting all k keywords (k is a natural number, t <k) in order of frequency from all the acquired keywords, Whether or not the number of keywords acquired as the top keywords in the frequency order is (t-1) or less, and the minimum frequency among the keywords selected in each client is different from the frequency of the t-th keyword of the client A determination means for determining whether or not the number of keywords acquired as top-ranked keywords in each computer is (t-1) or less, and the minimum frequency among the keywords selected in each client and the client If it is determined that the frequency of the t th keyword is different from each computer, A computer program that functions as means for selecting, from all acquired keywords, the top k keywords in order of frequency from the top (t-1) keywords in order of frequency of each computer.