JP6325132B2

JP6325132B2 - Data collection device and data collection method

Info

Publication number: JP6325132B2
Application number: JP2016562135A
Authority: JP
Inventors: 裕早矢仕; 石黒　正雄; 正雄石黒; 直史冨田; 和重廣井
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2014-12-03
Filing date: 2014-12-03
Publication date: 2018-05-16
Anticipated expiration: 2034-12-03
Also published as: JPWO2016088212A1; WO2016088212A1

Description

本発明は、データ収集装置、及びデータ収集方法に関する。 The present invention relates to a data collection device and a data collection method.

特許文献１には、「利用者が関連記事を検索したい記事を選択すると、基準検索件数に最も近いヒット件数となるような検索式を動的に変化させながら自動的に生成し、関連記事を表示する」と記載されている。 Patent Document 1 states that “When a user selects an article for which a related article is to be searched, a search expression that dynamically matches the number of hits closest to the reference search number is automatically generated while dynamically changing the related article. It is displayed.

特許文献２には、「ユーザのキーワード入力を受け付けるキーワード入力受付手段と、受け付けられたユーザ入力のキーワードで定まる検索式が対象の文書内で成立する件数を求めて該検索式と求めた件数とをユーザに提示する検索結果提示手段と、検索結果提示手段が提示した検索式の関連語を生成する関連語生成手段と、提示されたキーワードと生成された関連語とを含む検索式が対象の文書内で成立する件数を求めて該件数と生成された関連語とをユーザに提示する検索結果予想提示手段とを有する」と記載されている。 Patent Document 2 discloses that “a keyword input accepting unit that accepts a keyword input by a user, a number of cases in which a search expression determined by the accepted user input keyword is established in a target document, A search result presenting means for presenting to a user, a related word generating means for generating a related word of the search expression presented by the search result presenting means, and a search expression including the presented keyword and the generated related word It has a search result prediction presenting means for obtaining the number of cases established in the document and presenting the number of cases and the generated related words to the user.

特開２００５−１００１３６号公報JP 2005-100136 A 特開平５−３１４１８２号公報JP-A-5-314182

特許文献１では、基準検索件数に最も近いヒット件数となるような検索式を動的に変化させるため、検索結果にユーザが想定していた情報とは異なる情報が含まれてしまう場合や検索結果にユーザが想定していた情報が含まれない場合が生じうる。また特許文献２では、提示された情報に基づきユーザが関連語を選択して新たなキーワードを指示するので、特許文献１と同様の問題が生じうる。 In Patent Document 1, a search expression that dynamically matches the number of hits that is closest to the reference number of searches is dynamically changed. May not include information that the user is expecting. Further, in Patent Document 2, since the user selects a related word and designates a new keyword based on the presented information, the same problem as in Patent Document 1 may occur.

本発明は、検索条件を指定して行われるデータの収集に際し、ユーザが目的としている情報を精度よく収集することが可能な、データ収集装置、及びデータ収集方法を提供することを目的としている。 An object of the present invention is to provide a data collection apparatus and a data collection method capable of accurately collecting information intended by a user when collecting data performed by specifying a search condition.

上記目的を達成するための本発明の一つは、検索条件を満たす情報をデータ群から収集する情報収集装置であって、複数の検索条件の夫々を用いて前記データ群を検索し、前記複数の検索条件の夫々のヒット件数に応じて前記複数の検索条件を複数のクラスタに分類した結果に基づき、前記検索に用いる前記検索条件の候補を特定する。 One aspect of the present invention for achieving the above object is an information collection apparatus that collects information satisfying a search condition from a data group, and searches the data group using each of a plurality of search conditions, Based on the result of classifying the plurality of search conditions into a plurality of clusters according to the number of hits of each of the search conditions, the search condition candidates used for the search are specified.

その他、本願が開示する課題、及びその解決方法は、発明を実施するための形態の欄、及び図面により明らかにされる。 In addition, the subject which this application discloses, and its solution method are clarified by the column of the form for inventing, and drawing.

本発明によれば、検索条件を指定して行われるデータの収集に際し、ユーザが目的としている情報を精度よく収集することができる。 According to the present invention, it is possible to accurately collect information intended by a user when collecting data performed by specifying a search condition.

データ収集システム１の概略的な構成を示す図である。1 is a diagram illustrating a schematic configuration of a data collection system 1. FIG. データ収集装置１０の処理の概念を説明する図である。3 is a diagram for explaining a concept of processing of the data collection device 10. FIG. データ収集装置１０及びサーバ装置２０の実現に用いられる情報処理装置１００のハードウェアの一例である。It is an example of the hardware of the information processing apparatus 100 used for implementation | achievement of the data collection device 10 and the server apparatus 20. FIG. データ収集装置１０が備える機能及びデータ収集装置１０が管理するデータを説明するデータフロー図である。It is a data flow figure explaining the function with which data collection device 10 is provided, and the data which data collection device 10 manages. 収集データ４０２の一例である。3 is an example of collected data 402; 設定検索ワード４０３の一例である。It is an example of a setting search word 403. 同義語辞書４０５の一例である。It is an example of a synonym dictionary 405. 対義語辞書４０６の一例である。It is an example of an antonym dictionary 406. ヒット件数データ４０８の一例である。This is an example of hit count data 408. 検索条件データ４１１の一例である。It is an example of the search condition data 411. 検索条件生成処理Ｓ１１００を説明するフローチャートである。It is a flowchart explaining search condition generation processing S1100. 収集データ取得処理Ｓ１２００を説明するフローチャートである。It is a flowchart explaining collection data acquisition processing S1200. 検索処理Ｓ１３００を説明するフローチャートである。It is a flowchart explaining search processing S1300. クラスタ生成処理Ｓ１４００を説明するフローチャートである。It is a flowchart explaining cluster generation processing S1400. 除外条件判定処理Ｓ１５００を説明するフローチャートである。It is a flowchart explaining exclusion condition determination processing S1500. 除外条件設定画面１６００の一例である。It is an example of an exclusion condition setting screen 1600. 第２実施例におけるヒット件数データ４０８Ａの一例である。It is an example of hit number data 408A in 2nd Example. 第３実施例のクラスタ生成部４０９が行う類似度の判定例を示す図である。It is a figure which shows the example of determination of the similarity which the cluster production | generation part 409 of 3rd Example performs.

以下、本発明の一実施形態について図面とともに説明する。尚、以下の説明において、同一の機能及び構成を有する構成要素について同一の符号を付して重複した説明を省略することがある。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In the following description, components having the same function and configuration may be denoted by the same reference numerals and redundant description may be omitted.

＝第１実施例＝
図１は一実施形態として説明するデータ収集システム１の概略的な構成を示す図である。同図に示すように、データ収集システム１は、データ収集装置１０とサーバ装置２０を含む。データ収集装置１０とサーバ装置２０とは、通信ネットワーク５を介して通信可能に接続されている。通信ネットワーク５は、例えば、インターネットや専用回線等である。= First embodiment =
FIG. 1 is a diagram showing a schematic configuration of a data collection system 1 described as an embodiment. As shown in FIG. 1, the data collection system 1 includes a data collection device 10 and a server device 20. The data collection device 10 and the server device 20 are connected to be communicable via the communication network 5. The communication network 5 is, for example, the Internet or a dedicated line.

サーバ装置２０は、通信ネットワーク５を介してアクセスしてくる他の装置に対して情報（データ）を提供する装置（例えば、Ｗｅｂサーバ、ＳＮＳサーバ（SNS:Social Network Service）、オープンデータサーバ等）として機能する。 The server device 20 is a device that provides information (data) to other devices accessed via the communication network 5 (for example, a Web server, an SNS server (SNS: Social Network Service), an open data server, etc.) Function as.

データ収集装置１０は、通信ネットワーク５を介してサーバ装置２０にアクセスし、サーバ装置２０からデータを取得する。データ収集装置１０によって取得されるデータは、例えば、特定の話題に関する傾向分析や因果関係分析等に役立てられる。 The data collection device 10 accesses the server device 20 via the communication network 5 and acquires data from the server device 20. The data acquired by the data collection device 10 is useful for, for example, trend analysis and causal relationship analysis regarding a specific topic.

データ収集装置１０は、検索条件（検索式）を満たす情報（ヒットする情報）を、サーバ装置２０から取得されるデータ群から収集する。上記収集に際し、データ収集装置１０は、複数の検索条件を用いてデータ群を検索し、複数の検索条件の夫々のヒット件数に応じて複数の検索条件を複数のクラスタに分類した結果に基づき、上記データ群の検索に用いる検索条件の候補を特定する。 The data collection device 10 collects information (hit information) that satisfies the search condition (search formula) from the data group acquired from the server device 20. In the above collection, the data collection device 10 searches the data group using a plurality of search conditions, and based on the result of classifying the plurality of search conditions into a plurality of clusters according to the number of hits of each of the plurality of search conditions, A search condition candidate used for searching the data group is specified.

図２に示すように、データ収集装置１０は、上記特定に際し、平均値が予め設定された範囲に含まれないクラスタに属している検索条件を検索条件の候補から除外する。例えば、データ収集装置１０は、平均値が予め設定された範囲を超えるクラスタ（同図におけるクラスタＣ）に属する検索条件（ｈ，ｉ，ｊ）を上記候補の対象から除外する。 As shown in FIG. 2, the data collection device 10 excludes, from the search condition candidates, search conditions belonging to a cluster whose average value is not included in a preset range in the above specification. For example, the data collection device 10 excludes the search condition (h, i, j) belonging to a cluster (cluster C in the figure) whose average value exceeds a preset range from the candidate targets.

このようにデータ収集装置１０は、検索条件を複数のクラスタに分類し、クラスタを単位として検索条件の候補を特定するので、ヒット件数が非常に大きなクラスタに属する検索条件を有効に除外することができ、ユーザが目的としない情報（以下、ノイズ（noise）とも称する。）を含まない検索結果を取得可能な検索条件を特定することができる。尚、ヒット件数が非常に大きなクラスタに属する検索条件には、例えば多義性を持ったキーワードが検索条件に含まれており、ユーザが目的としない情報が多く含まれることとなるので、候補から除外する。 As described above, the data collection apparatus 10 classifies the search conditions into a plurality of clusters and specifies search condition candidates in units of clusters. Therefore, it is possible to effectively exclude search conditions belonging to clusters with a very large number of hits. It is possible to specify a search condition that can obtain a search result that does not include information that is not intended by the user (hereinafter also referred to as noise). Note that search conditions belonging to clusters with a very large number of hits are excluded from candidates because, for example, keywords with ambiguousness are included in the search conditions and a large amount of information not intended by the user is included. To do.

尚、図２に示しているように、データ収集装置１０は、上記複数の検索条件を、例えば、ある検索条件に含まれるワードの関連語を関連語辞書（同義語辞書、対義語辞書、具体化辞書等）から検索して置換することにより生成する。そのため、ユーザに負担を強いることなく、検索に用いる検索条件の候補の選択対象となる検索条件を効率よく生成することができる。 As shown in FIG. 2, the data collection device 10 uses the related word dictionary (synonym dictionary, synonym dictionary, materialization) for the plurality of search conditions, for example, related words of words included in a certain search condition. It is generated by searching from a dictionary etc. and replacing it. Therefore, it is possible to efficiently generate a search condition that is a selection target of search condition candidates used for the search without imposing a burden on the user.

データ収集装置１０は、検索条件の上記複数のクラスタの分類を、例えば、ｋ−ｍｅａｎｓ法により行う。またデータ収集装置１０は、検索に用いる検索条件の候補として特定した検索条件を提示して指定させるユーザインタフェースを備えており、ユーザから検索に用いる検索条件の指定を受け付ける。そのため、ユーザは目的とする情報を効率よく取得するのに適した検索条件を自ら最終的に決定することができる。 The data collection device 10 classifies the plurality of clusters of the search condition by, for example, the k-means method. In addition, the data collection device 10 includes a user interface that presents and specifies a search condition specified as a search condition candidate used for search, and accepts specification of the search condition used for search from the user. Therefore, the user can finally determine the search condition suitable for efficiently acquiring the target information.

図３はデータ収集装置１０やサーバ装置２０の実現に用いられる、情報処理装置１００のハードウェアの一例である。同図に示すように、情報処理装置１００は、プロセッサ１０１、記憶装置１０２、入力装置１０４、出力装置１０５、及び通信装置１０６を備える。これらはバス等の通信手段を介して通信可能に接続されている。 FIG. 3 is an example of hardware of the information processing apparatus 100 used for realizing the data collection apparatus 10 and the server apparatus 20. As illustrated in FIG. 1, the information processing apparatus 100 includes a processor 101, a storage device 102, an input device 104, an output device 105, and a communication device 106. These are connected to be communicable via a communication means such as a bus.

プロセッサ１０１は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）を用いて構成されている。記憶装置１０２は、プログラムやデータを記憶する装置であり、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＮＶＲＡＭ（Non Volatile RAM）、ハードディスクドライブ、ＳＳＤ（Solid State Drive）、光学式記憶装置等である。入力装置１０４は、ユーザから情報や指示の入力を受け付けるユーザインタフェースであり、例えば、キーボード、マウス、タッチパネル等である。出力装置１０５は、ユーザに情報を提供するユーザインタフェースであり、例えば、グラフィックカード、液晶モニタ等である。通信装置１０６は、通信ネットワーク５を介して他の装置と通信する通信インタフェースであり、例えば、ＮＩＣ（Network Interface Card）や無線ＬＡＮインタフェースである。 The processor 101 is configured using, for example, a CPU (Central Processing Unit) and an MPU (Micro Processing Unit). The storage device 102 is a device that stores programs and data. For example, ROM (Read Only Memory), RAM (Random Access Memory), NVRAM (Non Volatile RAM), hard disk drive, SSD (Solid State Drive), optical type A storage device or the like. The input device 104 is a user interface that receives input of information and instructions from the user, and is a keyboard, a mouse, a touch panel, or the like, for example. The output device 105 is a user interface that provides information to the user, such as a graphic card or a liquid crystal monitor. The communication device 106 is a communication interface that communicates with other devices via the communication network 5, and is, for example, a NIC (Network Interface Card) or a wireless LAN interface.

図４はデータ収集装置１０が備える機能及びデータ収集装置１０が管理するデータを説明するデータフロー図である。 FIG. 4 is a data flow diagram for explaining functions provided in the data collection device 10 and data managed by the data collection device 10.

同図に示すように、データ収集装置１０は、データ取得部４０１、検索条件生成部４０４、データ検索部４０７、クラスタ生成部４０９、除外条件判定部４１０、及び除外条件選択部４１２の各機能を備える。これらの機能は、プロセッサ１０１が、主記憶装置１０２に格納されているプログラムを読み出して実行することにより実現される。尚、これらの機能はハードウェア（ＡＳＩＣ（Application Specific Integrated Circuit）等）によって実現されるものであってもよい。またこれらの機能の複数が一つのハードウェアによって実現される構成としてもよいし、これらの機能が複数のハードウェアによって分散もしくは強調して実現される構成としてもよい。また同図に示す各機能は、データ収集装置１０の機能の理解を容易にするために便宜的に設定したものに過ぎず、各機能の分類の仕方や名称はここに示した態様に限定されない。 As shown in the figure, the data collection device 10 includes functions of a data acquisition unit 401, a search condition generation unit 404, a data search unit 407, a cluster generation unit 409, an exclusion condition determination unit 410, and an exclusion condition selection unit 412. Prepare. These functions are realized by the processor 101 reading and executing a program stored in the main storage device 102. These functions may be realized by hardware (ASIC (Application Specific Integrated Circuit) or the like). Further, a plurality of these functions may be realized by a single piece of hardware, or a configuration in which these functions are distributed or emphasized by a plurality of pieces of hardware. Each function shown in the figure is merely set for convenience in order to facilitate understanding of the function of the data collection device 10, and the classification method and name of each function are not limited to the modes shown here. .

同図に示すように、データ収集装置１０は、収集データ４０２、設定検索ワード４０３、同義語辞書４０５、対義語辞書４０６、ヒット件数データ４０８、及び検索条件データ４１１等を管理する。データ収集装置１０は、これらのデータを、例えば、ＤＢＭＳ（DataBase Management System）が提供するデータベースのテーブルとして管理する。尚、以下では、複数の検索条件の生成に用いる関連語辞書として、同義語辞書４０５並びに対義語辞書４０６を例示するが、関連語辞書として他の種類の辞書（例えば、あるワードの下位概念となるワードを記載した具体化辞書等）を用いてもよい。 As shown in the figure, the data collection device 10 manages collected data 402, a setting search word 403, a synonym dictionary 405, a synonym dictionary 406, hit count data 408, search condition data 411, and the like. The data collection device 10 manages these data as, for example, a database table provided by a DBMS (DataBase Management System). In the following, the synonym dictionary 405 and the synonym dictionary 406 are exemplified as related word dictionaries used for generating a plurality of search conditions, but other types of dictionaries (for example, subordinate concepts of a word) are used as related word dictionaries. A materialized dictionary describing words may be used.

同図に示す機能のうち、データ取得部４０１は、通信ネットワーク５を介してサーバ装置２０からデータ（ＳＮＳの投稿データ、ニュース記事等の記事データ、オープンデータ等）を取得する。具体的には、データ取得部４０１は、ＡＰＩ（Application Programming Interface）やクローリング（Crawling）等の手法により、サーバ装置２０から定期的に（例えば１週間に１度）データを取得する。データ取得部４０１は、取得したデータ群を、収集データ４０２として管理する。 Among the functions shown in the figure, the data acquisition unit 401 acquires data (SNS post data, article data such as news articles, open data, etc.) from the server device 20 via the communication network 5. Specifically, the data acquisition unit 401 acquires data periodically (for example, once a week) from the server device 20 by a technique such as API (Application Programming Interface) or crawling. The data acquisition unit 401 manages the acquired data group as collected data 402.

図５に収集データ４０２の一例を示す。同図に示す一行分（１レコード分）のデータが一つの収集データ４０２に相当する。同図に示すように、収集データ４０２は、提供日時４０２１、ソース４０２２、提供者ＩＤ４０２３、及び本文４０２４の各項目を含む。このうち提供日時４０２１には、当該収集データ４０２が提供された日時（例えば、収集データ４０２がＳＮＳに投稿された日時等）が格納される。ソース４０２２には、当該収集データ４０２の取得元の種類を示す情報が格納される。例えば、取得元がＳＮＳサーバであれば「ＳＮＳ」が、取得元がニュース記事を提供するＷｅｂサーバであれば「ニュース」が格納される。提供者ＩＤ４０２３には、当該収集データ４０２の提供者（投稿者、寄稿者等）を示す情報（ユーザ名、アカウント名等）が格納される。本文４０２４には、当該収集データ４０２の本文に相当する情報が格納される。 FIG. 5 shows an example of the collected data 402. The data for one line (one record) shown in FIG. As shown in the figure, the collected data 402 includes items of a provision date and time 4021, a source 4022, a provider ID 4023, and a body text 4024. Among these, the provision date 4021 stores the date when the collected data 402 was provided (for example, the date when the collected data 402 was posted to the SNS). The source 4022 stores information indicating the type of acquisition source of the collected data 402. For example, “SNS” is stored if the acquisition source is an SNS server, and “news” is stored if the acquisition source is a Web server providing a news article. In the provider ID 4023, information (user name, account name, etc.) indicating the provider (contributor, contributor, etc.) of the collected data 402 is stored. In the body 4024, information corresponding to the body of the collected data 402 is stored.

図４に戻り、検索条件生成部４０４は、例えば、ユーザが事前に登録した検索条件のワード群を設定検索ワード４０３から取得し、同義語辞書４０５からは検索条件の各ワードに関する同義語を、対義語辞書４０６からは検索条件の各ワードに関する対義語を、夫々取得し、ユーザが設定した検索条件のワードと、取得した同義語及び対義語とを組み合わせて複数の検索条件を生成する。 Returning to FIG. 4, for example, the search condition generation unit 404 acquires a search condition word group registered in advance by the user from the set search word 403, and from the synonym dictionary 405, synonyms for each word of the search condition are From the antonym dictionary 406, an antonym for each word of the search condition is acquired, and a plurality of search conditions are generated by combining the word of the search condition set by the user with the acquired synonym and antonym.

図６に設定検索ワード４０３の一例を示す。同図に示す一行分（１レコード分）のデータが一つの設定検索ワード４０３に相当する。設定検索ワード４０３には、ユーザが事前に設定した、検索条件として用いるワードが格納される。同図に示すように、設定検索ワード４０３は、検索条件ＩＤ４０３１、及び検索条件（検索ワード１（４０３２），検索ワード２（４０３３），検索ワード３（４０３４），・・・）の各項目を含む。このうち検索条件ＩＤ４０３１には、検索条件を特定する識別情報が格納される。検索条件（検索ワード１（４０３２），検索ワード２（４０３３），検索ワード３（４０３４），・・・）には、ユーザが事前に設定した、検索条件として用いるワードが格納される。 FIG. 6 shows an example of the setting search word 403. The data for one line (for one record) shown in the figure corresponds to one setting search word 403. The set search word 403 stores words used as search conditions set in advance by the user. As shown in the figure, the set search word 403 includes items of a search condition ID 4031 and search conditions (search word 1 (4032), search word 2 (4033), search word 3 (4034),...)). Including. Among these, the search condition ID 4031 stores identification information for specifying the search condition. In search conditions (search word 1 (4032), search word 2 (4033), search word 3 (4034),...)), Words used as search conditions set in advance by the user are stored.

図７に同義語辞書４０５（類義語辞書）の一例を示す。同図に示す一行分（１レコード分）のデータが一つの同義語辞書４０５に相当する。同義語辞書４０５は、一つのワード（対象語４０５１）とそのワードについての一つ以上の同義語（同義語４０５２，４０５３，４０５４，・・・）とを含む。例えば、同図における１行目のレコードの同義語辞書４０５は、対象語４０５１「将来」と、同義語４０５２「今後」及び同義語４０５３「未来」とを含む。 FIG. 7 shows an example of the synonym dictionary 405 (synonym dictionary). The data for one line (one record) shown in the figure corresponds to one synonym dictionary 405. The synonym dictionary 405 includes one word (target word 4051) and one or more synonyms (synonyms 4052, 4053, 4054,...) About the word. For example, the synonym dictionary 405 of the record on the first line in the figure includes the target word 4051 “future”, the synonym 4052 “future”, and the synonym 4053 “future”.

図８に対義語辞書４０６（反対語辞書）の一例を示す。同図に示す一行分（１レコード分）のデータが一つの対義語辞書４０６に相当する。対義語辞書４０６は、一つのワード（対象語４０６１）とそのワードについての一つ以上の対義語（対義語４０６２，４０６３，４０６４，・・・）とを含む。例えば、同図における２行目のレコードの対義語辞書４０６は、対象語４０６１「安心」と、対義語４０６２「不安」及び対義語４０６３「心配」とを含む。 FIG. 8 shows an example of the antonym dictionary 406 (antonym dictionary). The data for one line (one record) shown in the figure corresponds to one synonym dictionary 406. The antonym dictionary 406 includes one word (target word 4061) and one or more antonyms (antonyms 4062, 4063, 4064,...) For the word. For example, the antonym dictionary 406 of the record on the second line in the figure includes an object word 4061 “relief”, an antonym 4062 “anxiety”, and an antonym 4063 “anxiety”.

図４に戻り、データ検索部４０７は、検索条件生成部４０４が生成した複数の検索条件を用いて収集データ４０２を検索し、複数の検索条件の夫々のヒット件数をヒット件数データ４０８として管理する。 Returning to FIG. 4, the data search unit 407 searches the collected data 402 using the plurality of search conditions generated by the search condition generation unit 404 and manages the number of hits of each of the plurality of search conditions as hit number data 408. .

図９にヒット件数データ４０８の一例を示す。同図に示す一行分（１レコード分）のデータが一つのヒット件数データ４０８に相当する。ヒット件数データ４０８は、データ検索部４０７によって集計された検索条件（検索ワード候補１（４０８１），検索ワード候補２（４０８２），検索ワード候補３（４０８３），・・・）と、当該検索条件によるヒット件数４０８４とを含む。このうち検索条件（検索ワード候補１（４０８１），検索ワード候補２（４０８２），検索ワード候補３（４０８３），・・・）には、設定検索ワード４０３における検索条件のワード群と、設定検索ワード４０３に格納された検索条件のワード群の各ワードの同義語や対義語との組合せが格納される。例えば、同図では、設定検索ワード４０３に含まれる検索条件のワード群である（電話，休止）と、「休止」を同義語に置き換えた（電話，中止）、（電話，停止）、（電話，ポーズ）と、「電話」を同義語に置き換えた（テレホン，休止）が格納されている。 FIG. 9 shows an example of hit count data 408. The data for one line (for one record) shown in the figure corresponds to one hit number data 408. The hit count data 408 includes search conditions (search word candidate 1 (4081), search word candidate 2 (4082), search word candidate 3 (4083),...)) Counted by the data search unit 407, and the search conditions. And the number of hits 4084. Among these, the search condition (search word candidate 1 (4081), search word candidate 2 (4082), search word candidate 3 (4083),...)) Includes the search condition word group in the set search word 403 and the set search. A combination of synonyms and synonyms of each word in the word group of the search condition stored in the word 403 is stored. For example, in the same figure, the search condition word group included in the setting search word 403 (telephone, pause), “pause” is replaced with a synonym (telephone, abort), (telephone, pause), (telephone , Pause) and “telephone” (pause) in which “telephone” is replaced with a synonym are stored.

図４に戻り、クラスタ生成部４０９は、検索条件生成部４０４によって生成された複数の検索条件を夫々のヒット件数データ４０８のヒット件数４０８４に応じてクラスタに分類する。 Returning to FIG. 4, the cluster generation unit 409 classifies the plurality of search conditions generated by the search condition generation unit 404 into clusters according to the hit count 4084 of the respective hit count data 408.

除外条件判定部４１０は、検索条件をクラスタに分類した結果に基づき検索条件の候補を特定（候補から除外する検索条件を特定）する。 The exclusion condition determination unit 410 identifies search condition candidates (identifies search conditions to be excluded from candidates) based on the result of classifying the search conditions into clusters.

除外条件選択部４１２は、検索条件データ４１１の内容（候補とする検索条件、候補から除外する検索条件）をユーザに提示するとともに、ユーザから検索に用いる検索条件の指定を受け付ける。 The exclusion condition selection unit 412 presents the contents of the search condition data 411 (search conditions to be candidates, search conditions to be excluded from candidates) to the user, and accepts specification of search conditions used for the search from the user.

図１０に検索条件データ４１１の一例を示している。同図に示す一行分（１レコード分）のデータが一つの検索条件データ４１１に相当する。同図に示すように、検索条件データ４１１は、検索条件ＩＤ４１１１、検索条件（検索ワード候補１（４１１２），検索ワード候補２（４１１３），検索ワード候補３（４１１４），・・・）、除外対象判定結果４１１５、及び除外選択結果４１１６を含む。このうち検索条件ＩＤ４１１１には、検索条件を特定する識別情報が格納される。検索条件（検索ワード候補１（４１１２），検索ワード候補２（４１１３），検索ワード候補３（４１１４），・・・）には、ヒット件数データ４０８における検索条件（検索ワード候補１（４０８１），検索ワード候補２（４０８２），検索ワード候補３（４０８３），・・・）が格納される。除外対象判定結果４１１５には、除外条件判定部４１０が除外対象であるかを判定した結果が格納される。例えば、当該検索条件が除外対象と判定された場合、除外対象判定結果４１１５には「１」が、それ以外の場合に「０」が格納される。除外選択結果４１１６には、除外条件選択部４１２のユーザインタフェースを通じて、除外対象と判定された検索条件を除外するかをユーザが選択した結果が格納される。例えば、当該検索条件が除外対象とされていない場合は「−」が格納される。また当該検索条件が除外対象とされており、かつ、ユーザが除外を選択した場合は「１」が格納される。また当該検索条件が除外対象とされており、かつ、ユーザが除外を選択していない場合は「０」が格納される。 FIG. 10 shows an example of the search condition data 411. The data for one line (for one record) shown in the figure corresponds to one search condition data 411. As shown in the figure, the search condition data 411 includes a search condition ID 4111, search conditions (search word candidate 1 (4112), search word candidate 2 (4113), search word candidate 3 (4114),...), Exclusion. An object determination result 4115 and an exclusion selection result 4116. Among these, the search condition ID 4111 stores identification information for specifying the search condition. For the search conditions (search word candidate 1 (4112), search word candidate 2 (4113), search word candidate 3 (4114),...), The search conditions (search word candidate 1 (4081), Search word candidate 2 (4082), search word candidate 3 (4083),...) Are stored. The exclusion target determination result 4115 stores the result of determining whether the exclusion condition determination unit 410 is an exclusion target. For example, when the search condition is determined to be an exclusion target, “1” is stored in the exclusion target determination result 4115, and “0” is stored otherwise. The exclusion selection result 4116 stores the result of the user selecting whether to exclude the search condition determined to be excluded through the user interface of the exclusion condition selection unit 412. For example, if the search condition is not excluded, “-” is stored. Further, when the search condition is an exclusion target and the user selects exclusion, “1” is stored. If the search condition is an exclusion target and the user has not selected exclusion, “0” is stored.

続いて、データ収集装置１０が行う処理について説明する。 Next, processing performed by the data collection device 10 will be described.

＜検索条件生成処理＞
図１１は、データ収集装置１０の検索条件生成部４０４が、設定検索ワード４０３、同義語辞書４０５、及び対義語辞書４０６に基づき、収集データ４０２の検索条件を生成する処理（以下、検索条件生成処理Ｓ１１００と称する。）を説明するフローチャートである。データ収集装置１０は、例えば、入力装置１４に対して当該処理の開始操作が行われたことを契機として検索条件生成処理Ｓ１１００を開始する。以下、同図とともに検索条件生成処理Ｓ１１００について説明する。尚、以下では、検索ワード５０１が２つである場合（図６において、検索ワード４０３２及び検索ワード４０３３が設定されている場合）を例として説明する。<Search condition generation process>
FIG. 11 illustrates a process in which the search condition generation unit 404 of the data collection apparatus 10 generates a search condition for the collected data 402 based on the set search word 403, the synonym dictionary 405, and the synonym dictionary 406 (hereinafter referred to as a search condition generation process). This is a flowchart for explaining (referred to as S1100). For example, the data collection device 10 starts the search condition generation processing S1100 when a start operation of the processing is performed on the input device 14. Hereinafter, the search condition generation processing S1100 will be described with reference to FIG. In the following description, a case where there are two search words 501 (when search word 4032 and search word 4033 are set in FIG. 6) will be described as an example.

同図に示すように、検索条件生成部４０４は、まず１つ目の検索ワード１（４０３２）を設定検索ワード４０３から取得する（Ｓ１１０１）。以下、１つ目の検索ワード１（４０３２）の関連語のリストを関連語リストＷ１、関連語リストＷ１に含まれるｎ番目の関連語をＷ１［ｎ］と表記する。 As shown in the figure, the search condition generation unit 404 first acquires the first search word 1 (4032) from the set search word 403 (S1101). Hereinafter, a list of related words of the first search word 1 (4032) is expressed as a related word list W1, and an nth related word included in the related word list W1 is expressed as W1 [n].

続いて、検索条件生成部４０４は、同義語辞書４０５から１つ目の検索ワード１（４０３２）の同義語を取得し（以下、取得した同義語の総数をＷ１ＳＮと表記する。）、取得した同義語を、検索ワード１関連語リストＷ１［１］〜Ｗ１［Ｗ１ＳＮ］に登録する（Ｓ１１０２）。尚、Ｗ１［０］には１つ目の検索ワード１（４０３２）が登録されるものとする。 Subsequently, the search condition generation unit 404 acquires the synonym of the first search word 1 (4032) from the synonym dictionary 405 (hereinafter, the total number of acquired synonyms is expressed as W1SN). Synonyms are registered in the search word 1 related word list W1 [1] to W1 [W1SN] (S1102). It is assumed that the first search word 1 (4032) is registered in W1 [0].

続いて、検索条件生成部４０４は、対義語辞書４０６から１つ目の検索ワード１（４０３２）の対義語を取得し（以下、取得した対義語の総数をＷ１ＡＮと表記する。）、取得した対義語を、検索ワード１関連語リストＷ１［Ｗ１ＳＮ＋１］〜Ｗ１［Ｗ１ＳＮ＋Ｗ１ＡＮ］に登録する（Ｓ１１０３）。 Subsequently, the search condition generation unit 404 acquires the antonym of the first search word 1 (4032) from the antonym dictionary 406 (hereinafter, the total number of acquired antonyms is expressed as W1AN), and the acquired antonym is The search word 1 related word list W1 [W1SN + 1] to W1 [W1SN + W1AN] is registered (S1103).

次に、検索条件生成部４０４は、２つ目の検索ワード２（４０３３）を、設定検索ワード４０３から取得する（Ｓ１１０４）。以下、２つ目の検索ワード２（４０３３）の関連語のリストを関連語リストＷ２、関連語リストＷ２に含まれるｎ番目の関連語をＷ２［ｎ］と表記する。 Next, the search condition generation unit 404 acquires the second search word 2 (4033) from the set search word 403 (S1104). Hereinafter, a list of related words of the second search word 2 (4033) is expressed as a related word list W2, and an nth related word included in the related word list W2 is expressed as W2 [n].

続いて、検索条件生成部４０４は、同義語辞書４０５から２つ目の検索ワード２（４０３３）の同義語を取得し（以下、取得した同義語の総数をＷ２ＳＮと表記する。）、取得した同義語を、検索ワード１関連語リストＷ２［１］〜Ｗ２［Ｗ２ＳＮ］に登録する（Ｓ１１０５）。尚、Ｗ２［０］には２つ目の検索ワード２（４０３３）が登録されるものとする。 Subsequently, the search condition generation unit 404 acquires the synonym of the second search word 2 (4033) from the synonym dictionary 405 (hereinafter, the total number of acquired synonyms is expressed as W2SN). Synonyms are registered in the search word 1 related word list W2 [1] to W2 [W2SN] (S1105). It is assumed that the second search word 2 (4033) is registered in W2 [0].

続いて、検索条件生成部４０４は、対義語辞書４０６から２つ目の検索ワード２（４０３３）の対義語を取得し（以下、取得した対義語の総数をＷ２ＡＮと表記する。）、取得した対義語を、検索ワード２関連語リストＷ２［Ｗ２ＳＮ＋１］〜Ｗ２［Ｗ２ＳＮ＋Ｗ２ＡＮ］に登録する（Ｓ１１０６）。 Subsequently, the search condition generation unit 404 acquires an antonym of the second search word 2 (4033) from the antonym dictionary 406 (hereinafter, the total number of acquired antonyms is expressed as W2AN). The search word 2 related word lists W2 [W2SN + 1] to W2 [W2SN + W2AN] are registered (S1106).

Ｓ１１０７では、検索条件生成部４０４は、関連語リストにおける関連語を指定するインデックスとして用いる変数Ｉ１に０を代入する。 In S1107, the search condition generation unit 404 substitutes 0 for a variable I1 used as an index for specifying a related word in the related word list.

Ｓ１１０８では、検索条件生成部４０４は、関連語リストにおける関連語を指定するインデックスとして用いる変数Ｉ２に０を代入する。 In S1108, the search condition generation unit 404 substitutes 0 for a variable I2 used as an index for specifying a related word in the related word list.

続いて、検索条件生成部４０４は、関連語の組（Ｗ１［Ｉ１］，Ｗ２［Ｉ２］）をヒット件数データ４０８の検索条件（検索ワード候補１（４０８１）、検索ワード候補２（４０８２）、検索ワード候補３（４０８３），・・・）として、もしくは検索条件データ４１１として登録する（Ｓ１１０９）。 Subsequently, the search condition generation unit 404 uses the search condition (search word candidate 1 (4081), search word candidate 2 (4082)) of the hit number data 408 as a set of related words (W1 [I1], W2 [I2]). The search word candidate 3 (4083),...) Or the search condition data 411 is registered (S1109).

続いて、検索条件生成部４０４は、Ｉ２がＷ２ＳＮ＋Ｗ２ＡＮとなるまで（Ｓ１１１０の条件が成立（Ｓ１１１０：ＹＥＳ）するまで）、Ｉ２を繰り返しインクリメントしつつ（Ｓ１１１１）、Ｓ１１０９の処理を繰り返す。 Subsequently, the search condition generation unit 404 repeats the process of S1109 while repeatedly incrementing I2 (S1111) until I2 becomes W2SN + W2AN (until the condition of S1110 is satisfied (S1110: YES)).

またＳ１１１０の条件が成立すると（Ｓ１１１０：ＹＥＳ）、検索条件生成部４０４は、Ｉ１がＷ１ＳＮ＋Ｗ１ＡＮとなるまで（Ｓ１１１２の条件が成立（Ｓ１１１２：ＹＥＳ）するまで）、Ｉ１を繰り返しインクリメントしつつ（Ｓ１１１３）、Ｓ１１０８〜Ｓ１１１１の処理を繰り返す。 When the condition of S1110 is satisfied (S1110: YES), the search condition generation unit 404 repeatedly increments I1 until I1 becomes W1SN + W1AN (until the condition of S1112 is satisfied (S1112: YES)) (S1113). , S1108 to S1111 are repeated.

ここでＳ１１０７〜Ｓ１１１３の処理は、各検索ワード（検索ワード１，検索ワード２）の関連語リストから１つずつ関連語を選択し、これらの関連語の組を検索条件データ４１１として登録する処理を、全ての関連語の組み合わせについて実施していることに相当する。従って、例えば、１つ目の検索ワード１（４０３２）の関連語リストが（安心，安堵）であり、２つ目の検索ワード２（４０３３）の関連語リストが（将来，今後）である場合、（安心，将来）、（安堵，将来）、（安心，今後）、（安堵，今後）の４つが検索条件として登録される。 Here, the processing of S1107 to S1113 is a process of selecting related words one by one from the related word list of each search word (search word 1, search word 2) and registering a set of these related words as search condition data 411. Is equivalent to executing for all combinations of related terms. Thus, for example, when the related word list of the first search word 1 (4032) is (safe, relief) and the related word list of the second search word 2 (4033) is (future, future). , (Reliable, Future), (Anhui, Future), (Reliable, Future), and (Anhui, Future) are registered as search conditions.

ところで、以上では、検索ワード５０１が２つである場合を例として説明したが、検索ワード５０１の数はいくつであってもよい。尚、検索ワード５０１の数がｎである場合、例えば、各検索ワードについて関連語リストＷ１〜Ｗｎを作成し、関連語の組（Ｗ１［Ｉ１］，Ｗ２［Ｉ２］，…，Ｗｎ［Ｉｎ］）を検索条件に登録することになる。 By the way, although the case where there are two search words 501 has been described above as an example, the number of search words 501 may be any number. When the number of search words 501 is n, for example, related word lists W1 to Wn are created for each search word, and a set of related words (W1 [I1], W2 [I2],..., Wn [In] ) Is registered as a search condition.

＜収集データ取得処理＞
図１２は、データ収集装置１０のデータ取得部４０１が、通信ネットワーク５を介してサーバ装置２０からデータ（収集データ４０２）を取得する処理（以下、収集データ取得処理Ｓ１２００と称する。）を説明するフローチャートである。データ収集装置１０は、例えば、検索条件生成処理Ｓ１１００の終了後に収集データ取得処理Ｓ１２００を実行する。以下、同図とともに収集データ取得処理Ｓ１２００について説明する。<Collecting data acquisition process>
FIG. 12 illustrates processing in which the data acquisition unit 401 of the data collection device 10 acquires data (collected data 402) from the server device 20 via the communication network 5 (hereinafter referred to as collected data acquisition processing S1200). It is a flowchart. For example, the data collection device 10 executes the collected data acquisition process S1200 after the search condition generation process S1100 ends. Hereinafter, the collected data acquisition process S1200 will be described with reference to FIG.

同図に示すように、データ取得部４０１は、通信ネットワーク５を通じてサーバ装置２０からデータを取得し、収集データ４０２として記憶する（Ｓ１１０１）。例えば、データ取得部４０１は、予めユーザが指定したサーバ装置２０が提供するデータを取得する。また例えば、データ取得部４０１は、サーバ装置２０に含まれるデータのうち、ユーザが指定した期間におけるデータを取得する。 As shown in the figure, the data acquisition unit 401 acquires data from the server device 20 through the communication network 5 and stores it as collected data 402 (S1101). For example, the data acquisition unit 401 acquires data provided by the server device 20 specified in advance by the user. Further, for example, the data acquisition unit 401 acquires data in a period specified by the user among the data included in the server device 20.

＜検索処理＞
図１３は、データ収集装置１０のデータ検索部４０７が収集データ４０２を検索する処理（以下、検索処理Ｓ１３００と称する。）を説明するフローチャートである。データ収集装置１０は、例えば、収集データ取得処理Ｓ１２００の終了後に検索処理Ｓ１３００を実行する。以下、同図とともに検索処理Ｓ１３００について説明する。<Search process>
FIG. 13 is a flowchart illustrating a process in which the data search unit 407 of the data collection device 10 searches the collected data 402 (hereinafter referred to as search process S1300). For example, the data collection device 10 executes the search process S1300 after the collection data acquisition process S1200 ends. Hereinafter, the search processing S1300 will be described with reference to FIG.

同図に示すように、まずデータ検索部４０７は、検索条件生成部４０４により生成された検索条件データ４１１から検索条件のリスト及び当該リストに含まれている検索条件の総数を取得する（Ｓ１３０１）。以降では、検索条件のリストをＳ，検索条件のリストにおけるI番目の検索条件をＳ［Ｉ］、上記総数をＳＣＮと表記する。 As shown in the figure, first, the data search unit 407 obtains a list of search conditions and the total number of search conditions included in the list from the search condition data 411 generated by the search condition generation unit 404 (S1301). . Hereinafter, the search condition list is denoted by S, the I-th search condition in the search condition list is denoted by S [I], and the total number is denoted by SCN.

続いて、データ検索部４０７は、変数Ｉに０を代入する（Ｓ１３０２）。尚、変数Ｉは、検索条件を指定するためのインデックスである。 Subsequently, the data search unit 407 substitutes 0 for the variable I (S1302). The variable I is an index for designating a search condition.

続いて、データ検索部４０７は、収集データ４０２から検索条件Ｓ［Ｉ］にヒットするものを取得する（Ｓ１３０３）。 Subsequently, the data search unit 407 acquires from the collected data 402 what hits the search condition S [I] (S1303).

続いて、データ検索部４０７は、Ｓ１３０３にて取得した収集データ４０２の数（ヒット件数）を求め、求めた値をヒット件数４０８４に設定して現在選択中の検索条件Ｓ［Ｉ］のヒット件数データ４０８を生成する（Ｓ１３０５）。 Subsequently, the data search unit 407 obtains the number (hit number) of the collected data 402 acquired in S1303, sets the obtained value as the hit number 4084, and hits the currently selected search condition S [I]. Data 408 is generated (S1305).

データ検索部４０７は、Ｉをインクリメントしつつ（Ｓ１３０６）、Ｓ１３０３〜Ｓ１３０４の処理を、Ｉ＝ＳＣＮとなるまで、即ち検索条件のリスト（Ｓ）の全ての検索条件を対象として（Ｓ１３０５：ＹＥＳ）繰り返し実行する。 The data search unit 407 increments I (S1306) and performs the processing of S1303 to S1304 until I = SCN, that is, for all search conditions in the search condition list (S) (S1305: YES). Run repeatedly.

以上の検索処理Ｓ１３００が実行されることにより、検索条件生成部４０４によって生成された複数の検索条件の夫々について、検索条件にヒットする収集データ４０２の件数（ヒット件数）を取得することができる。 By executing the above search processing S1300, the number of collected data 402 (number of hits) that hit the search condition can be acquired for each of the plurality of search conditions generated by the search condition generating unit 404.

＜クラスタ生成処理＞
図１４は、データ収集装置１０のクラスタ生成部４０９が、ヒット件数データ４０８に登録されている検索条件（検索ワード候補１（４０８１），検索ワード候補２（４０８２），検索ワード候補３（４０８３），・・・）をクラスタに分類する処理（以下、クラスタ生成処理Ｓ１４００と称する。）を説明するフローチャートである。データ収集装置１０は、例えば、検索処理Ｓ１３００の終了後にクラスタ生成処理Ｓ１４００を実行する。以下、クラスタリング手法として「ｋ−ｍｅａｎｓ法」を用いた場合を例として説明する。<Cluster generation processing>
In FIG. 14, the cluster generation unit 409 of the data collection device 10 uses the search conditions (search word candidate 1 (4081), search word candidate 2 (4082), search word candidate 3 (4083)) registered in the hit count data 408. ,... Is a flowchart for explaining a process of classifying a cluster (hereinafter referred to as a cluster generation process S1400). For example, the data collection device 10 executes the cluster generation process S1400 after the search process S1300 ends. Hereinafter, a case where the “k-means method” is used as a clustering method will be described as an example.

同図に示すように、まずクラスタ生成部４０９は、ヒット件数データ４０８の各ヒット件数４０８４（ヒット件数データ４０８の各レコード）をＫ個のクラスタのいずれかに（ランダムに）割り当てる（Ｓ１４０１）。クラスタの数Ｋは、例えば、ユーザが予め設定することができる。尚、ユーザはクラスタの数Ｋを調節することで、目的とする情報を精度よく取得するのにより適した検索条件を探ることができる。以下、ヒット件数データ４０８のヒット件数４０８４のリストに含まれるI番目のヒット件数をｖ［Ｉ］、ヒット件数４０８４のリストに含まれるヒット件数（レコード）の総数をＮと表記する。 As shown in the figure, first, the cluster generation unit 409 assigns (randomly) each hit number 4084 (each record of the hit number data 408) of the hit number data 408 to any of the K clusters (S1401). The number K of clusters can be preset by the user, for example. It should be noted that the user can search for a search condition more suitable for acquiring the target information with high accuracy by adjusting the number K of clusters. Hereinafter, the number of I-th hits included in the list of hit counts 4084 of the hit count data 408 is denoted as v [I], and the total number of hits (records) included in the list of hit counts 4084 is denoted as N.

続いて、クラスタ生成部４０９は、クラスタを特定するインデックスである変数ｋに０を代入する（Ｓ１４０２）。 Subsequently, the cluster generation unit 409 substitutes 0 for a variable k which is an index for specifying a cluster (S1402).

続いて、クラスタ生成部４０９は、インデックスｋで特定されるクラスタに属する件数ｖ［ｎ］の平均値を求め、これをインデックスｋで特定されるクラスタの中心値Ｃ［ｋ］とする（Ｓ１４０３）。 Subsequently, the cluster generation unit 409 obtains an average value of the number v [n] belonging to the cluster specified by the index k, and sets this as the center value C [k] of the cluster specified by the index k (S1403). .

続いて、クラスタ生成部４０９は、変数ｋをインクリメントしつつ（Ｓ１４０５）、変数ｋがＫとなるまでＳ１４０３の処理を繰り返し実行し、各クラスタの中心値Ｃ［ｋ］（ｋ＝０〜Ｋ）を算出する。 Subsequently, the cluster generation unit 409 repeatedly executes the processing of S1403 while incrementing the variable k (S1405) until the variable k becomes K, and the center value C [k] (k = 0 to K) of each cluster. Is calculated.

続いて、クラスタ生成部４０９は、ヒット件数４０８４のリストの一つを特定するインデックスである変数ｎに０を代入する（Ｓ１４０６）。 Subsequently, the cluster generation unit 409 substitutes 0 for a variable n that is an index for specifying one of the lists of hit counts 4084 (S1406).

続いて、クラスタ生成部４０９は、クラスタの中心値Ｃ［ｋ］とｖ［ｎ］との距離が最短となるクラスタｋにヒット件数ｖ［ｎ］を割り当てる（Ｓ１４０７）。尚、クラスタの中心値Ｃ［ｋ］とヒット件数ｖ［ｎ］との距離は、例えば、Ｃ［ｋ］―ｖ［ｎ］の絶対値として求められる。 Subsequently, the cluster generation unit 409 assigns the hit count v [n] to the cluster k having the shortest distance between the cluster center value C [k] and v [n] (S1407). The distance between the cluster center value C [k] and the number of hits v [n] is obtained as an absolute value of C [k] −v [n], for example.

続いて、クラスタ生成部４０９は、変数ｎをインクリメントしつつ（Ｓ１４０９）、変数ｎがＮとなるまでＳ１４０７の処理を繰り返し実行し（Ｓ１４０８）、全てのヒット件数ｖ［ｎ］（即ち検索条件（検索ワード候補１（４０８１），検索ワード候補２（４０８２），検索ワード候補３（４０８３），・・・））をクラスタの中心値Ｃ［ｋ］との距離が最短となるクラスタに割り当て直す。 Subsequently, the cluster generation unit 409 increments the variable n (S1409), repeatedly executes the process of S1407 until the variable n becomes N (S1408), and the number of hits v [n] (that is, the search condition ( Search word candidate 1 (4081), search word candidate 2 (4082), search word candidate 3 (4083),...)) Are reassigned to the cluster having the shortest distance from the cluster center value C [k].

次に、クラスタ生成部４０９は、上記Ｓ１４０７の処理の繰り返しの実行中にクラスタの割り当てに変更が生じたか否かを判定する（Ｓ１４１０）。クラスタの割り当てに変更が生じた場合（Ｓ１４１０：Ｙｅｓ）、クラスタ生成部４０９は、Ｓ１４０２〜Ｓ１４１０の処理を再度実行する。一方、上記Ｓ１４０７の処理の繰り返しの実行中に検索条件のクラスタの割り当てに変更が生じていない場合（Ｓ１４１０：Ｎｏ）、クラスタ生成処理Ｓ１４００は終了する。 Next, the cluster generation unit 409 determines whether or not a change in cluster assignment has occurred during the repeated execution of the processing of S1407 (S1410). When the cluster assignment is changed (S1410: Yes), the cluster generation unit 409 executes the processes of S1402 to S1410 again. On the other hand, if the search condition cluster assignment has not been changed during the repeated execution of the process of S1407 (S1410: No), the cluster generation process S1400 ends.

以上の仕組みによれば、ヒット件数データ４０８に登録されているヒット件数（即ち検索条件（検索ワード候補１（４０８１），検索ワード候補２（４０８２），検索ワード候補３（４０８３），・・・））を、件数の近い検索条件で構成される複数のクラスタに分類することができる。尚、以上では「ｋ−ｍｅａｎｓ法」を用いたが、他のクラスタリング手法を用いてもよい。 According to the above mechanism, the number of hits registered in the hit number data 408 (that is, the search condition (search word candidate 1 (4081), search word candidate 2 (4082), search word candidate 3 (4083),... )) Can be classified into a plurality of clusters composed of search conditions with a close number of cases. In the above, the “k-means method” is used, but other clustering methods may be used.

＜除外条件判定処理＞
図１５は、データ収集装置１０の除外条件判定部４１０が、所属する検索件数が多いクラスタに属する検索条件を除外することを目的として行う処理（以下、除外条件判定処理Ｓ１５００と称する。）を説明するフローチャートである。データ収集装置１０は、例えば、クラスタ生成処理Ｓ１４００の終了後に除外条件判定処理Ｓ１５００を実行する。以下、同図とともに除外条件判定処理Ｓ１５００について説明する。<Exclusion condition judgment processing>
FIG. 15 illustrates a process (hereinafter referred to as “exclusion condition determination process S1500”) performed by the exclusion condition determination unit 410 of the data collection apparatus 10 for the purpose of excluding search conditions belonging to a cluster to which many search cases belong. It is a flowchart to do. For example, the data collection device 10 executes the exclusion condition determination process S1500 after the cluster generation process S1400 ends. Hereinafter, the exclusion condition determination processing S1500 will be described with reference to FIG.

同図に示すように、まず除外条件判定部４１０は、クラスタを特定するインデックスである変数ｋに０を代入する（Ｓ１５０１）。 As shown in the figure, the exclusion condition determination unit 410 first substitutes 0 for a variable k that is an index for specifying a cluster (S1501).

次に、除外条件判定部４１０は、クラスタの中心値Ｃ［ｋ］が変数ｐより大きいか否かを判定する（Ｓ１５０２）。ここで変数ｐは、検索条件を除外するか否かを判定する閾値である。例えば、変数ｐは、全クラスタの中心値Ｃ［ｋ］の平均値をＣｍ、分散Ｃｖ、ユーザがあらかじめ定めたパラメータをαとして次式から求められる。

尚、クラスタの内容をユーザが確認してユーザが変数ｐを指定する構成としてもよい。ユーザは変数ｐを調節することで、目的とする情報を取得するのにより適した検索条件を抽出することができる。Next, the exclusion condition determination unit 410 determines whether or not the cluster center value C [k] is larger than the variable p (S1502). Here, the variable p is a threshold value for determining whether to exclude the search condition. For example, the variable p is obtained from the following equation, where Cm is an average value of the center values C [k] of all clusters, variance Cv, and α is a parameter predetermined by the user.

A configuration in which the user confirms the contents of the cluster and the user designates the variable p may be adopted. By adjusting the variable p, the user can extract a search condition that is more suitable for acquiring target information.

クラスタの中心値Ｃ［ｋ］が変数ｐより大きい場合（Ｓ１５０２：Ｙｅｓ）、除外条件判定部４１０は、クラスタｋに含まれる件数ｖ［ｎ］に対応する検索条件を除外対象と判定し、判定結果を検索条件データ４１１の除外対象判定結果４１１５として設定する（Ｓ１５０３）。除外条件判定部４１０は、ｋをインクリメントしつつ（Ｓ１５０５）、Ｓ１５０２〜Ｓ１５０３の処理をｋ＝Ｋとなるまで繰り返し実行する（Ｓ１５０４）。 When the center value C [k] of the cluster is larger than the variable p (S1502: Yes), the exclusion condition determination unit 410 determines that the search condition corresponding to the number v [n] included in the cluster k is an exclusion target and determines The result is set as an exclusion target determination result 4115 of the search condition data 411 (S1503). The exclusion condition determination unit 410 increments k (S1505), and repeatedly executes the processing of S1502 to S1503 until k = K (S1504).

以上の除外条件判定処理Ｓ１５００により、検索件数が多いクラスタに属する検索条件を除外対象として特定することができる。例えば、ヒット件数データ４０８が図９の内容である場合、（電話，ポーズ）、（テレホン，休止）等の検索条件が除外対象となる。尚、この例では「ポーズ」というワードが多義性を有しており、検索結果に本来収集したい話題とは異なるデータ（ノイズ）が多く含まれているため除外対象とされている。このように除外条件判定処理Ｓ１５００を行うことで、検索結果にユーザが目的としないデータ（ノイズ）が多く含まれてしまうような検索条件を効率よく除外することができる。 Through the above exclusion condition determination processing S1500, search conditions belonging to a cluster with a large number of searches can be specified as exclusion targets. For example, when the hit count data 408 has the contents shown in FIG. 9, search conditions such as (phone, pause) and (telephone, pause) are excluded. In this example, the word “pause” is ambiguous, and the search result includes a lot of data (noise) that is different from the topic to be originally collected. By performing the exclusion condition determination processing S1500 in this way, it is possible to efficiently exclude a search condition that includes a lot of data (noise) that the user does not intend in the search result.

図１６は、除外条件選択部４１２が、ユーザに最終的に除外する検索条件を決定させる際に表示装置２０５に表示する画面（以下、除外条件設定画面１６００と称する。）の一例である。同図に示すように、除外条件設定画面１６００は、検索条件の表示欄１６０１、検索条件の除外指定欄１６０２、及び除外実行指示ボタン１６０３を有する。検索条件の表示欄１６０１には、除外条件判定部４１０によって除外対象と判定された検索条件のワード群が表示される。検索条件の除外指定欄１６０２には、ユーザが検索条件表示部１６０１に表示された検索条件を除外するか否かを指定する欄（例えば、チェックボックス）が表示される。除外実行指示ボタン１６０３は、ユーザがデータ収集装置１０に対して検索条件の除外指定欄１６０２にて除外指定した検索条件を除外するための処理の実行を指示するためのユーザインタフェースである。除外条件設定画面１６００を介して行われた除外指定の内容は、図１０の検索条件データ４１１の除外選択結果４１１６に反映される。 FIG. 16 is an example of a screen (hereinafter, referred to as an exclusion condition setting screen 1600) displayed on the display device 205 when the exclusion condition selection unit 412 allows the user to determine a search condition to be finally excluded. As shown in the figure, the exclusion condition setting screen 1600 includes a search condition display field 1601, a search condition exclusion designation field 1602, and an exclusion execution instruction button 1603. The search condition display field 1601 displays a group of search condition words determined to be excluded by the exclusion condition determination unit 410. In the search condition exclusion designation field 1602, a field (for example, a check box) for designating whether or not the user excludes the search condition displayed on the search condition display unit 1601 is displayed. The exclusion execution instruction button 1603 is a user interface for instructing the data collection device 10 to execute processing for excluding the search condition specified in the search condition exclusion specification field 1602. The contents of the exclusion designation made via the exclusion condition setting screen 1600 are reflected in the exclusion selection result 4116 of the search condition data 411 in FIG.

ユーザは除外条件設定画面１６００を介して除外条件判定部４１０によって除外判定された検索条件（逆に言えばデータ収集装置１０が提示する検索条件の候補）を容易に確認することができる。またユーザは除外条件設定画面１６００を介して除外する検索条件を自ら決定することができる。 The user can easily confirm the search condition (in other words, the search condition candidate presented by the data collection device 10) determined to be excluded by the exclusion condition determination unit 410 via the exclusion condition setting screen 1600. Further, the user can determine the search condition to be excluded through the exclusion condition setting screen 1600 by himself / herself.

以上に説明したように、本実施形態のデータ収集装置１０は、検索条件を複数のクラスタに分類し、クラスタを単位として検索条件の候補を特定するので、ヒット件数が非常に大きなクラスタに属する検索条件を除外することができ、ユーザが目的としない情報（ノイズ）を含まない検索結果を得ることが可能な検索条件の候補を特定することができる。 As described above, the data collection device 10 according to the present embodiment classifies search conditions into a plurality of clusters, and specifies search condition candidates in units of clusters. Therefore, a search belonging to a cluster having a very large number of hits. Conditions can be excluded, and search condition candidates that can obtain a search result that does not include information (noise) that is not intended by the user can be specified.

＝第２実施例＝
第２実施例は、基本的な構成は第１実施例と同様であるが、検索件数の集計に際し、データの取得元に応じて検索件数に重み付けを行っている。以下、第１実施例と構成が相違する部分を中心として説明する。= Second Example =
The basic structure of the second embodiment is the same as that of the first embodiment, but the number of search cases is weighted according to the data acquisition source when the number of search cases is tabulated. In the following, description will be made centering on the parts that differ from the first embodiment.

図１７は、第２実施例におけるヒット件数データ４０８（以下、ヒット件数データ４０８Ａと称する。）の一例である。同図に示すように、第２実施例におけるヒット件数データ４０８Ａは、前述した検索条件（検索ワード候補１（４０８１）、検索ワード候補２（４０８２）、検索ワード候補３（４０８３），・・・）に加えて、取得元別検索件数１（１７０１），取得元別検索件数２（１７０２），・・・、及び検索スコア１７０３の各項目を有する。 FIG. 17 is an example of hit count data 408 (hereinafter referred to as hit count data 408A) in the second embodiment. As shown in the figure, the hit number data 408A in the second embodiment is based on the search conditions (search word candidate 1 (4081), search word candidate 2 (4082), search word candidate 3 (4083),...). ), The number of searches by acquisition source 1 (1701), the number of searches by acquisition source 2 (1702), and the search score 1703.

このうち取得元別検索件数１（１７０１），取得元別検索件数２（１７０２），・・・は、収集データ４０２において、検索条件（検索ワード候補１（４０８１）、検索ワード候補２（４０８２）、検索ワード候補３（４０８３），・・・）を含む収集データ４０２の件数を、収集データ４０２の取得元毎に集計したものである。例えば、同図における取得元別検索件数１（１７０１）は、ＳＮＳサーバから取得した収集データ４０２のうち、検索ワード候補１（４０８１）及び検索ワード候補２（４０８２）を含むデータの件数であり、取得元別検索件数２（１７０２）は、ニュース記事を提供するＷｅｂサーバから取得した収集データ４０２のうち、検索ワード候補１（４０８１）及び検索ワード候補２（４０８２）を含むデータの件数である。検索スコア１７０３は、取得元別検索件数１（１７０１），取得元別検索件数２（１７０２），・・・の夫々を、取得元に応じて夫々に重み付けをして加算した値である。 Of these, the number of searches by acquisition source 1 (1701), the number of searches by acquisition source 2 (1702),... Are the search conditions (search word candidate 1 (4081), search word candidate 2 (4082)) in the collected data 402. , The search word candidate 3 (4083),...) Is totaled for each acquisition source of the collection data 402. For example, the number of search cases 1 (1701) by acquisition source in the figure is the number of data including the search word candidate 1 (4081) and the search word candidate 2 (4082) in the collected data 402 acquired from the SNS server. The number of searches by acquisition source 2 (1702) is the number of data including search word candidate 1 (4081) and search word candidate 2 (4082) in the collected data 402 acquired from the Web server that provides the news article. The search score 1703 is a value obtained by weighting and adding each of the number of searches by acquisition source 1 (1701), the number of searches by acquisition source 2 (1702),... According to the acquisition source.

クラスタ生成部４０９は、検索スコア１７０３をヒット件数としてクラスタ生成処理Ｓ１４００を行い、検索条件をクラスタに分類する。このように、収集データ４０２の取得元に応じて重みを設定して各検索条件のヒット件数を調整し、その上で検索条件をクラスタに分類することで、取得元の性質等を考慮しつつ、検索条件の候補を得ることができる。例えば、収集データ４０２の取得元が提供する情報の信頼性（信憑性）が高い程、取得元に高い重みを設定することで、信頼性（信憑性）の高い情報を収集するといったことが可能になる。 The cluster generation unit 409 performs cluster generation processing S1400 using the search score 1703 as the number of hits, and classifies the search conditions into clusters. In this way, by setting the weight according to the acquisition source of the collected data 402 and adjusting the number of hits of each search condition, and then classifying the search conditions into clusters, the characteristics of the acquisition source are taken into consideration The search condition candidate can be obtained. For example, the higher the reliability (credibility) of the information provided by the acquisition source of the collected data 402, the higher the reliability (credibility) information can be collected by setting a higher weight to the acquisition source. become.

＝第３実施例＝
第３実施例では、複数の検索条件に対して、定められた時間範囲毎に検索条件にヒットする収集データ４０２の件数を集計し、これらの件数の推移の類似度を算出し、求めた類似度に基づき検索条件を複数のクラスタに分類する。= Third Example =
In the third embodiment, for a plurality of search conditions, the number of collected data 402 that hit the search condition is counted for each predetermined time range, the degree of similarity of the transition of these numbers is calculated, and the obtained similarity The search condition is classified into a plurality of clusters based on the degree.

第３実施例において、データ検索部４０７は、ヒット件数データ４０８に格納されている検索条件（検索ワード候補１（４０８１）、検索ワード候補２（４０８２）、検索ワード候補３（４０８３），・・・）を読み込み、各検索条件に対し、定められた時間範囲毎に検索条件にヒットする収集データ４０２の件数を集計する。例えば、定められた時間範囲を年単位とした場合、指定された検索条件に対して、２０１１年に投稿された収集データ４０２のうち検索条件にヒットする収集データ４０２の件数、２０１２年に投稿された収集データ４０２のうち検索条件にヒットする収集データ４０２の件数、というように、各年度において検索条件にヒットする収集データ４０２の件数を夫々集計する。第３実施例におけるヒット件数データ４０８には、定められた時間範囲毎に検索条件にヒットする収集データ４０２の件数が格納される。 In the third embodiment, the data search unit 407 includes search conditions (search word candidate 1 (4081), search word candidate 2 (4082), search word candidate 3 (4083),...) Stored in the hit count data 408. (1) is read, and the number of collected data 402 hitting the search condition is counted for each search condition for each predetermined time range. For example, when a predetermined time range is assumed to be a year unit, the number of collection data 402 that hit the search condition out of the collection data 402 posted in 2011 for the specified search condition, posted in 2012. The number of collected data 402 that hit the search condition in each fiscal year, such as the number of collected data 402 that hit the search condition in the collected data 402, is totaled. In the hit number data 408 in the third embodiment, the number of collected data 402 hitting the search condition for each predetermined time range is stored.

第３実施例のクラスタ生成部４０９は、２つの検索条件について、夫々にヒットする収集データ４０２の件数の推移の類似度を測定する。例えば、クラスタ生成部４０９は、検索条件ａと検索条件ｂの夫々にヒットする収集データ４０２の件数の推移の類似度を次式から求める。

上式において、ａ（ｔ）は、時間範囲ｔにおいて検索条件ａに合致する収集データ４０２の件数であり、ｂ（ｔ）は時間範囲ｔにおいて検索条件ｂに合致する収集データ４０２の件数であり、ｋは時間ずれに対応するパラメータである。ｋは、例えば、ユーザが予め指定してもよいし、ヒット件数データ４０８から推定してもよい。The cluster generation unit 409 of the third embodiment measures the degree of similarity in the number of collected data 402 that hit each other for two search conditions. For example, the cluster generation unit 409 obtains the similarity of the transition of the number of collected data 402 that hits the search condition a and the search condition b from the following expression.

In the above formula, a (t) is the number of collected data 402 that matches the search condition a in the time range t, and b (t) is the number of collected data 402 that matches the search condition b in the time range t. , K are parameters corresponding to the time lag. For example, k may be designated in advance by the user or may be estimated from the hit number data 408.

図１８に第３実施例のクラスタ生成部４０９が行う類似度の判定例を示す。同図において、符号１８０１で示す枠内は、類似度が大きい検索条件の組の例であり、符号１８０２で示す枠内は、類似度が小さい検索条件の組の例である。符号１８０１で示す枠内の検索条件の組では、商品Ａという商品名に、「東京」、「大阪」という異なる地名を加えた検索条件間の類似度が大きいことを示している。一方、符号１８０２で示す枠内の検索条件の組では、商品Ｂという商品名に、「東京」、「大阪」という異なる地名を加えた検索条件間の類似度が小さいことを示している。 FIG. 18 shows a similarity determination example performed by the cluster generation unit 409 of the third embodiment. In the figure, the frame indicated by reference numeral 1801 is an example of a set of search conditions having a high degree of similarity, and the frame indicated by reference numeral 1802 is an example of a set of search conditions having a low degree of similarity. The set of search conditions in the frame indicated by reference numeral 1801 indicates that the similarity between the search conditions in which different place names such as “Tokyo” and “Osaka” are added to the product name “Product A” is large. On the other hand, the set of search conditions within the frame denoted by reference numeral 1802 indicates that the similarity between the search conditions is small, in which different place names such as “Tokyo” and “Osaka” are added to the product name of the product B.

この例の場合、例えば、類似度が予め定められた閾値を超えている検索条件の組については同じクラスタに所属させ、類似度が上記閾値未満の検索条件の組については異なるクラスタに所属させるようにする。このように、検索条件の類似度に応じて検索条件をクラスタに分類することで、各検索条件のヒット件数の時間変化の類似性を考慮しつつ検索条件をクラスタに分類することができ、ヒット件数の時間変化の類似性を考慮した上で検索条件を決定することができる。尚、ヒット件数の時間変化の類似性を判定することについて他の効果として、例えば、検索条件ごとのヒット件数の推移の類似度に基づき、特定の話題について情報の伝播が発生しているか否かを判定することができる。 In this example, for example, a group of search conditions whose similarity exceeds a predetermined threshold value belongs to the same cluster, and a group of search conditions whose similarity is less than the above threshold value belongs to a different cluster. To. In this way, by classifying search conditions into clusters according to the similarity of search conditions, it is possible to classify search conditions into clusters while taking into account the similarity of the number of hits in each search condition over time. The search condition can be determined in consideration of the similarity of the number of cases with time. As another effect of determining the similarity of the number of hits over time, for example, whether or not information propagation has occurred on a specific topic based on the similarity of the transition of the number of hits for each search condition. Can be determined.

ところで、本発明は上記した実施形態に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施形態の構成の一部を他の実施形態の構成に置き換えることが可能であり、また、ある実施形態の構成に他の実施形態の構成を加えることも可能である。また、各実施形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 By the way, this invention is not limited to above-described embodiment, Various modifications are included. For example, the above-described embodiment has been described in detail for easy understanding of the present invention, and is not necessarily limited to one having all the configurations described. Further, a part of the configuration of an embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of an embodiment. In addition, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

上記の各構成、機能、処理部、処理手段等は、それらの一部または全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサが夫々の機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ等の記録装置や、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。 Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. In addition, each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files for realizing each function can be stored in a recording device such as a memory, a hard disk, or an SSD, or a recording medium such as an IC card, an SD card, or a DVD.

制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 The control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

１データ収集システム、５通信ネットワーク、１０データ収集装置、２０サーバ装置、１００情報処理装置、４０１データ取得部、４０２収集データ、４０３設定検索ワード、４０４検索条件生成部、４０５同義語辞書、４０６対義語辞書
４０７データ検索部、４０８ヒット件数データ、４０９クラスタ生成部、４１０除外条件判定部、４１１検索条件データ、４１２除外条件選択部、Ｓ１１００検索条件生成処理、Ｓ１２００収集データ取得処理、Ｓ１３００検索処理、Ｓ１４００クラスタ生成処理、Ｓ１５００除外条件判定処理、Ｓ１６００除外条件設定画面DESCRIPTION OF SYMBOLS 1 Data collection system, 5 Communication network, 10 Data collection apparatus, 20 Server apparatus, 100 Information processing apparatus, 401 Data acquisition part, 402 Collection data, 403 Setting search word, 404 Search condition generation part, 405 Synonym dictionary, 406 Synonym Dictionary 407 Data search unit, 408 hit count data, 409 cluster generation unit, 410 exclusion condition determination unit, 411 search condition data, 412 exclusion condition selection unit, S1100 search condition generation process, S1200 collection data acquisition process, S1300 search process, S1400 Cluster generation process, S1500 exclusion condition determination process, S1600 exclusion condition setting screen

Claims

An information processing apparatus that collects information satisfying a search condition from a data group,
Search the data group using each of a plurality of search conditions,
A data collection device that identifies candidates for the search condition used for the search based on a result of classifying the plurality of search conditions into a plurality of clusters according to the number of hits of each of the plurality of search conditions.

The data collection device according to claim 1,
A data collection device for obtaining an average value of the number of hits for each of the search conditions belonging to each of the clusters, and identifying the search condition candidates based on the average value.

The data collection device according to claim 2,
A data collection device that excludes, from the candidate targets, the search condition belonging to the cluster whose average value is not included in a preset range.

The data collection device according to any one of claims 1 to 3,
A data collection device that automatically generates a search condition different from the search condition by searching for a related word of a word included in the search condition from a related word dictionary and replacing it.

The data collection device according to claim 4,
The data collection device, wherein the related word dictionary is at least one of a synonym dictionary, an antonym dictionary, and an instantiation dictionary.

The data collection device according to claim 1,
A data collection device that classifies the plurality of search conditions into clusters by a k-means method.

The data collection device according to claim 1,
A data collection device comprising a user interface for presenting and selecting a search condition specified as a candidate for the search condition used for the search.

The data collection device according to claim 1,
Total the number of hits by data acquisition source of the data group,
After obtaining the total number of hits for each acquisition source after weighting the number of hits for each acquisition source,
A data collection device that identifies candidates for the search condition used for the search based on a result of classifying the plurality of search conditions into a plurality of clusters according to the total value of each of the plurality of search conditions.

The data collection device according to claim 1,
A data collection device that classifies the plurality of search conditions into clusters according to the similarity of time-series changes in the number of hits of each of the plurality of search conditions.

The data collection device according to claim 1,
A data collection device that is communicably connected to the Internet and accesses the data group via the Internet.

An information collection method for collecting information satisfying a search condition from a data group,
Information processing device
A first step of searching the data group using each of a plurality of search conditions;
A data collection method comprising: a second step of identifying candidates for the search condition used for the search based on a result of classifying the plurality of search conditions into a plurality of clusters according to the number of hits of each of the plurality of search conditions.

The data collection method according to claim 11, comprising:
In the second step, the information processing apparatus obtains an average value of the number of hits of each of the search conditions belonging to each of the clusters, and specifies the search condition candidates based on the average value. Collection method.

The data collection method according to claim 12, comprising:
The data collection method, wherein, in the second step, the information processing apparatus excludes the search condition belonging to the cluster that is not included in a preset range from the candidates.

The data collection method according to claim 11, comprising:
The information processing apparatus is
A step of counting the number of hits by data acquisition source of the data group; and
Calculating the total number of hits for each acquisition source after weighting the number of hits for each acquisition source;
Further including
In the second step, based on a result of classifying the plurality of search conditions into a plurality of clusters according to the total value of each of the plurality of search conditions, the search condition candidates used for the search are specified. Method.

The data collection method according to claim 11, comprising:
The data collection method, wherein the information processing apparatus classifies the plurality of search conditions into clusters according to the similarity of time-series changes in the number of hits of the plurality of search conditions in the second step.