JP2015026103A

JP2015026103A - Data gathering apparatus, data gathering method, and program

Info

Publication number: JP2015026103A
Application number: JP2013153441A
Authority: JP
Inventors: 亮博小林; Akihiro Kobayashi; 啓一郎帆足; Keiichiro Hoashi
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2013-07-24
Filing date: 2013-07-24
Publication date: 2015-02-05
Anticipated expiration: 2033-07-24
Also published as: JP6103228B2

Abstract

PROBLEM TO BE SOLVED: To secure real-time property and integrity of data, to efficiently gather data on a network.SOLUTION: A data gathering apparatus 100 includes: a search condition storage unit 110 which stores a search condition; a long-period search unit 120 which executes full search on a database (DB) server on a network on the basis of the search condition; a short-period search unit 130 which searches the DB server so as to give high priority to the latest data at high frequency on the basis of the search condition; a search incidental information storage unit 141 which stores a search result, a search unit which acquires the search result, and a search time; a search result storage unit 142 which stores the search results without redundancy; a short-period adjustment unit 150 which decides a search period of the short-period search unit 130 on the basis of the data in the search incidental information storage unit 141; and a long-period adjustment unit 160 which adjusts a search period of the long-period search unit 120 on the basis of the data in the search incidental information storage unit 141.

Description

本発明は、ネットワーク上のデータを効率よく収集するデータ収集装置、データ収集方法、およびプログラムに関する。 The present invention relates to a data collection device, a data collection method, and a program that efficiently collect data on a network.

ネットワーク上に存在するＳＮＳ（ｓｏｃｉａｌｎｅｔｗｏｒｋｉｎｇｓｅｒｖｉｃｅ）や掲示板等のデータをマーケティング等に利用しようとした場合、システム内に一度データを蓄積し、蓄積したデータに対し検索等を行う必要がある。なぜなら、ネットワーク上の情報は永続性が保証されていないため、ＳＮＳや掲示板等において、古いデータは削除されていってしまうからである。 When data such as an SNS (social networking service) or a bulletin board existing on a network is used for marketing or the like, it is necessary to store the data once in the system and perform a search or the like on the stored data. This is because the information on the network is not guaranteed to be permanent, and old data is deleted on the SNS, bulletin board, and the like.

そこで、データをＳＮＳや掲示板等から収集し蓄積する方法として先ず考えられるのは、ＳＮＳや掲示板等上の全データを収集し、検索用ＤＢ（ＤａｔａＢａｓｅ）に蓄積することである。例えば、非特許文献１に示されるＴｗｉｔｔｅｒ等のサービスは、これらの要望に応えるため、ＳＮＳや掲示板等から収集し蓄積するシステム側に、サーバ側がリアルタイムでデータ配信を行うＳｔｒｅａｍＡＰＩを提供している。さらに、データをＳＮＳや掲示板等から収集し蓄積するシステムの検索用ＤＢが、ＳＮＳや掲示板等のサーバと間でデータベース間連携が可能であれば、非特許文献２に示されるような既存のレプリケーション技術を用いて、全データを高速にコピーすることが可能である。 Therefore, a possible method for collecting and storing data from an SNS or a bulletin board is to collect all data on the SNS or the bulletin board and store it in a search DB (DataBase). For example, a service such as Twitter shown in Non-Patent Document 1 provides a Stream API in which the server side distributes data in real time on the system side that collects and stores from SNS, bulletin boards, and the like in order to meet these demands. Furthermore, if a database for searching a system that collects and stores data from an SNS or bulletin board can be linked with a database such as an SNS or bulletin board, an existing replication as shown in Non-Patent Document 2 can be used. Using technology, it is possible to copy all data at high speed.

上述したＳｔｒｅａｍＡＰＩは、ＳＮＳや掲示板等のデータが更新されたこと、すなわち、ユーザから新しい書き込みがあったことに応じて即座にデータを送信することにより、完全なデータをリアルタイムにシステム側に送信することが可能であった。しかしながら、多くのサービスは商業上・パフォーマンス上の見地からこれらの全データのコピーは許可しておらず、また、ＳＮＳや掲示板等からデータを収集し蓄積するシステムにとっても、ユーザが使用しない大量のデータを保持することになり、膨大なコストが発生するという問題点があった。 The above-mentioned StreamAPI transmits complete data to the system side in real time by immediately transmitting data when data such as SNS or bulletin board is updated, that is, when there is a new writing from the user. It was possible. However, many services do not permit copying of all of these data from a commercial and performance standpoint, and for a system that collects and stores data from SNS, bulletin boards, etc. There is a problem that enormous costs occur because data is held.

そのため、特許文献１や特許文献２に示すように、キーワードを指定することでユーザが必要とするデータのみを収集するといった方法が提案されている。特許文献１では、該当するキーワードを持ったページのリンクを巡回しデータを収集する技術が、特許文献２では助詞に着目することで日本語のデータのみを効率よく収集する技術が提案されている。 Therefore, as shown in Patent Document 1 and Patent Document 2, a method has been proposed in which only data required by the user is collected by specifying a keyword. Patent Document 1 proposes a technique for collecting data by visiting links of pages having corresponding keywords. Patent Document 2 proposes a technique for efficiently collecting only Japanese data by paying attention to particles. .

特開２００５−３４６５９８号公報JP 2005-346598 A 特開２０１２−０３２９０３号公報JP 2012-032903 A

Ｔｗｉｔｔｅｒ、［２０１３年５月１５日検索インターネット＜ＵＲＬ：ｈｔｔｐ：／／ｔｗｉｔｔｅｒ．ｃｏｍ／＞］Twitter, [Search May 15, 2013 Internet <URL: http: // twitter. com />] 三島健，中村宏，“ｅａｇｅｒレプリケーションミドルウェアの並行制御方法”，電子情報通信学会論文誌分冊Ｄ，Ｖｏｌ．Ｊ９３−Ｄ，Ｎｏ．３，ｐｐ．２３２−２４０，２０１０Ken Mishima, Hiroshi Nakamura, “Concurrent control method for eager replication middleware”, IEICE Transactions Volume D, Vol. J93-D, No. 3, pp. 232-240, 2010

しかしながら、特許文献１や特許文献２に示す方法では、未取得のデータがないように時間をかけてデータを全検索するが、ＳＮＳ・掲示板等はユーザの書き込みにより新しいデータが逐次発生するため、検索開始後に発生したデータを取得することはできなかった。そのため、特許文献１や特許文献２に示す方法では、データのリアルタイム性と完全性とを両立することができないといった問題点があった。 However, in the methods shown in Patent Document 1 and Patent Document 2, all data is searched over time so that there is no unacquired data, but new data is sequentially generated by SNS / bulletin board, etc. The data that occurred after the start of the search could not be obtained. For this reason, the methods shown in Patent Document 1 and Patent Document 2 have a problem that it is impossible to achieve both real-time property and completeness of data.

そこで本発明は、上記課題に鑑みて、データのリアルタイム性と完全性とを両立して、ネットワーク上のデータを効率よく収集するデータ収集装置、およびデータ収集方法を提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a data collection device and a data collection method that efficiently collect data on a network while achieving both real-time property and completeness of data.

本発明は、上記の課題を解決するために、以下の事項を提案している。なお、理解を容易にするために、本発明の実施形態に対応する符号を付して説明するが、これに限定されるものではない。 The present invention proposes the following matters in order to solve the above problems. In addition, in order to make an understanding easy, although the code | symbol corresponding to embodiment of this invention is attached | subjected and demonstrated, it is not limited to this.

（１）本発明は、ネットワーク上のデータを収集するデータ収集装置であって、ユーザにより設定された検索条件を記憶する検索条件記憶手段（例えば、図１の検索条件記憶部１１０）と、前記ネットワーク上のデータサーバを、前記検索条件に基づいて、全検索を実行する長周期検索手段（例えば、図１の長周期検索部１２０）と、前記ネットワーク上のデータサーバを、前記検索条件に基づいて、前記長周期検索手段より高頻度に、最新のデータを優先する検索を実行する短周期検索手段（例えば、図１の短周期検索部１３０）と、前記長周期検索手段および前記短周期検索手段で取得された検索結果に、当該検索結果を取得した検索手段と当該検索結果が取得された時間とを対応付けて記憶する検索付随情報記憶手段（例えば、図１の検索付随情報記憶部１４１）と、前記長周期検索手段および前記短周期検索手段で取得された検索結果を、重複を排除して記憶する検索結果記憶手段（例えば、図１の検索結果記憶部１４２）と、前記検索付随情報記憶手段を参照し、前記長周期検索手段で単位時間当たりに取得する検索結果の件数として求められる、前記データベースサーバにおいて前記検索条件に一致するデータの単位時間当たりの平均増加数と、前記短周期検索手段で単位時間当たりに取得した検索結果数とに基づいて、当該短周期検索手段が行う検索の周期を決定する短周期調整手段（例えば、図１の短周期調整部１５０）と、前記検索付随情報記憶手段を参照し、前記短周期検索手段で取得された検索結果に対する、前記長周期検索手段で取得された検索結果の重複度合に基づいて、前記長周期検索手段が実行する検索の周期を調整する長周期調整手段（例えば、図１の長周期調整部１６０）と、を備えるデータ収集装置を提案している。 (1) The present invention is a data collection device that collects data on a network, and includes search condition storage means (for example, the search condition storage unit 110 in FIG. 1) that stores search conditions set by a user, Based on the search condition, a long-cycle search unit (for example, the long-cycle search unit 120 in FIG. 1) that performs a full search for the data server on the network and the data server on the network based on the search condition. In addition, the short cycle search unit (for example, the short cycle search unit 130 in FIG. 1) that executes a search that gives priority to the latest data more frequently than the long cycle search unit, the long cycle search unit, and the short cycle search Search associated information storage means (for example, FIG. 1) that stores the search result acquired by the means in association with the search means that acquired the search result and the time when the search result was acquired. Search associated information storage unit 141) and search result storage unit (for example, search result storage unit 142 in FIG. 1) that stores the search results acquired by the long cycle search unit and the short cycle search unit by eliminating duplication. ) And an average per unit time of data that matches the search condition in the database server, which is obtained as the number of search results acquired per unit time by the long cycle search means with reference to the search accompanying information storage means Based on the increase number and the number of search results acquired per unit time by the short cycle search unit, a short cycle adjustment unit (for example, the short cycle adjustment in FIG. 1) that determines a search cycle performed by the short cycle search unit. 150) and the search-accompanying information storage means, and the search result obtained by the long-cycle search means is compared with the search result obtained by the short-cycle search means. A data collection device is proposed that includes a long cycle adjustment unit (for example, the long cycle adjustment unit 160 in FIG. 1) that adjusts a search cycle executed by the long cycle search unit based on the degree of complexity.

（２）本発明は、（１）のデータ収集装置について、前記長周期調整手段が、前記検索付随情報記憶手段を参照し、前記長周期検索手段において、前後の検索により取得された検索結果の重複度合に基づいて、前記長周期検索手段が実行する検索の周期を調整するデータ収集装置を提案している。 (2) In the data collection device according to (1), the long cycle adjustment unit refers to the search-accompanying information storage unit, and the long cycle search unit stores search results obtained by previous and subsequent searches. A data collection device is proposed that adjusts the search cycle executed by the long cycle search means based on the degree of overlap.

（３）本発明は、（１）または（２）のデータ収集装置について、前記短周期調整手段が、前記検索付随情報記憶手段を参照し、前記長周期検索手段で取得された検索結果に対する、前記短周期検索手段で取得された検索結果の重複度合に基づいて、前記短周期検索手段が実行する検索の周期を調整するデータ収集装置を提案している。 (3) In the data collection device according to (1) or (2), the short cycle adjustment unit refers to the search accompanying information storage unit, and with respect to the search result acquired by the long cycle search unit, A data collection device is proposed that adjusts the search cycle executed by the short cycle search unit based on the degree of duplication of the search results acquired by the short cycle search unit.

（４）本発明は、（１）から（３）のデータ収集装置について、前記短周期調整手段が、前記検索付随情報記憶手段を参照し、前記短周期検索手段において、前後の検索により取得された検索結果の重複度合に基づいて、前記短周期検索手段が実行する検索の周期を調整するデータ収集装置を提案している。 (4) According to the present invention, in the data collection device of (1) to (3), the short cycle adjustment means is acquired by a previous and subsequent search in the short cycle search means with reference to the search accompanying information storage means. A data collection device is proposed that adjusts the search cycle executed by the short cycle search means based on the degree of duplication of the search results.

（５）本発明は、（１）から（４）のデータ収集装置について、前記検索条件記憶手段が、前記検索条件を複数記憶している際に、前記検索付随情報記憶手段を参照し、前記検索条件記憶手段に記憶されている各検索条件の検索結果の取得率と取得量とに基づいて、当該各検索条件へのリソースの配分割合を決定するリソース配分決定手段（例えば、図３のリソース配分決定部１８０）と、前記短周期検索手段および前記長周期検索手段における、前記各検索条件の検索を実行するプロセスに、前記各検索条件へのリソースの配分割合に基づいてリソースを割り当てるリソース制御手段（例えば、図３のリソース制御部１９０）と、を備えるデータ収集装置を提案している。 (5) In the data collection device of (1) to (4), the present invention refers to the search associated information storage unit when the search condition storage unit stores a plurality of the search conditions, Based on the acquisition rate and acquisition amount of the search results of each search condition stored in the search condition storage means, resource allocation determining means (for example, the resource shown in FIG. 3) determines the resource allocation ratio to each search condition. Resource determining unit 180), and resource control for allocating resources based on the distribution ratio of resources to each search condition to the process for executing the search for each search condition in the short cycle search means and the long cycle search means A data collection device including means (for example, the resource control unit 190 in FIG. 3) is proposed.

（６）本発明は、（１）から（５）のデータ収集装置について、前記短周期検索手段が、前記長周期検索手段による検索が実行開始された以降に前記データベースサーバに蓄積されたデータを検索対象とすることを特徴とするデータ収集装置を提案している。 (6) According to the present invention, in the data collection device of (1) to (5), the short-cycle search unit stores data accumulated in the database server after the search by the long-cycle search unit is started. A data collection device characterized by being a search target has been proposed.

（７）本発明は、（１）から（６）のデータ収集装置について、前記ユーザから検索条件を受け付けたことに応じて、前記ネットワーク上のデータサーバに対し、当該受け付けた検索条件に基づいて検索を行うオンデマンド検索手段（例えば、図１のオンデマンド検索部１７０）を備え、前記検索結果記憶手段が、前記オンデマンド検索手段で取得された検索結果を、重複を排除して記憶することを特徴とするデータ収集装置を提案している。 (7) The present invention relates to the data collection device according to (1) to (6), in response to receiving the search condition from the user, to the data server on the network based on the received search condition. An on-demand search unit (for example, on-demand search unit 170 in FIG. 1) that performs a search, and the search result storage unit stores the search result acquired by the on-demand search unit without duplication. The data collection device characterized by

（８）本発明は、ユーザにより設定された検索条件を記憶する検索条件記憶手段、長周期検索手段、短周期検索手段、検索付随情報記憶手段、検索結果記憶手段、短周期調整手段、および長周期調整手段を備えるデータ収集装置におけるデータ収集方法であって、前記長周期検索手段が、ネットワーク上のデータベースサーバを、前記検索条件に基づいて、全検索を実行する第１のステップ（例えば、図２のステップＳ１）と、前記短周期検索手段が、前記ネットワーク上のデータベースサーバを、前記検索条件に基づいて、前記長周期検索手段より高頻度に、最新のデータを優先する検索を実行する第２のステップ（例えば、図２のステップＳ２）と、前記検索付随情報記憶手段が、前記第１のステップおよび前記第２のステップで取得された検索結果に、当該検索結果を取得した検索手段と当該検索結果が取得された時間とを対応付けて記憶する第３のステップ（例えば、図２のステップＳ３）と、前記検索結果記憶手段が、前記第１のステップおよび前記第２のステップで取得された検索結果を、重複を排除して記憶する第４のステップ（例えば、図２のステップＳ４）と、前記短周期調整手段が、前記検索付随情報記憶手段を参照し、前記長周期検索手段で単位時間当たりに取得する検索結果の件数として求められる、前記データベースサーバにおいて前記検索条件に一致するデータの単位時間当たりの平均増加数と、前記短周期検索手段で単位時間当たりに取得した検索結果数とに基づいて、当該短周期検索手段が行う検索の周期を決定する第５のステップ（例えば、図２のステップＳ６）と、前記長周期調整手段が、前記検索付随情報記憶手段を参照し、前記短周期検索手段で取得された検索結果に対する、前記長周期検索手段で取得された検索結果の重複度合に基づいて、前記長周期検索手段が実行する検索の周期を調整する第６のステップ（例えば、図２のステップＳ７）と、を含むデータ収集方法を提案している。 (8) The present invention provides a search condition storage means for storing search conditions set by a user, a long cycle search means, a short cycle search means, a search associated information storage means, a search result storage means, a short cycle adjustment means, and a long A data collection method in a data collection apparatus including a cycle adjustment unit, wherein the long cycle search unit performs a full search for a database server on a network based on the search condition (for example, FIG. Step S1), and the short cycle search means executes a search for giving priority to the latest data to the database server on the network more frequently than the long cycle search means based on the search condition. 2 (for example, step S2 in FIG. 2) and the search-accompanying information storage means are acquired in the first step and the second step. A third step (for example, step S3 in FIG. 2) for storing the search result obtained by acquiring the search result and the time when the search result was acquired in association with the search result obtained, and the search result storage unit However, a fourth step (for example, step S4 in FIG. 2) for storing the search results obtained in the first step and the second step by eliminating duplication, and the short cycle adjusting means, An average increase in data per unit time that matches the search condition in the database server, which is obtained as the number of search results acquired per unit time by the long-cycle search means with reference to the search accompanying information storage means And a fifth step of determining a search cycle performed by the short cycle search unit based on the number of search results acquired per unit time by the short cycle search unit (for example, FIG. Step S6) of 2 and the long cycle adjustment means refer to the search accompanying information storage means, and the search results obtained by the long cycle search means overlap with the search results obtained by the short cycle search means A data collection method including a sixth step (for example, step S7 in FIG. 2) for adjusting a search cycle executed by the long cycle search means based on the degree is proposed.

（９）本発明は、ユーザにより設定された検索条件を記憶する検索条件記憶手段、長周期検索手段、短周期検索手段、検索付随情報記憶手段、検索結果記憶手段、短周期調整手段、および長周期調整手段を備えるデータ収集装置におけるデータ収集方法をコンピュータに実行させるためのプログラムであって、前記長周期検索手段が、ネットワーク上のデータベースサーバを、前記検索条件に基づいて、全検索を実行する第１のステップ（例えば、図２のステップＳ１）と、前記短周期検索手段が、前記ネットワーク上のデータベースサーバを、前記検索条件に基づいて、前記長周期検索手段より高頻度に、最新のデータを優先する検索を実行する第２のステップ（例えば、図２のステップＳ２）と、前記検索付随情報記憶手段が、前記第１のステップおよび前記第２のステップで取得された検索結果に、当該検索結果を取得した検索手段と当該検索結果が取得された時間とを対応付けて記憶する第３のステップ（例えば、図２のステップＳ３）と、前記検索結果記憶手段が、前記第１のステップおよび前記第２のステップで取得された検索結果を、重複を排除して記憶する第４のステップ（例えば、図２のステップＳ４）と、前記短周期調整手段が、前記検索付随情報記憶手段を参照し、前記長周期検索手段で単位時間当たりに取得する検索結果の件数として求められる、前記データベースサーバにおいて前記検索条件に一致するデータの単位時間当たりの平均増加数と、前記短周期検索手段で単位時間当たりに取得した検索結果数とに基づいて、当該短周期検索手段が行う検索の周期を決定する第５のステップ（例えば、図２のステップＳ６）と、前記長周期調整手段が、前記検索付随情報記憶手段を参照し、前記短周期検索手段で取得された検索結果に対する、前記長周期検索手段で取得された検索結果の重複度合に基づいて、前記長周期検索手段が実行する検索の周期を調整する第６のステップ（例えば、図２のステップＳ７）と、をコンピュータに実行させるためのプログラムを提案している。 (9) The present invention provides a search condition storage means for storing a search condition set by a user, a long cycle search means, a short cycle search means, a search associated information storage means, a search result storage means, a short cycle adjustment means, and a long A program for causing a computer to execute a data collection method in a data collection apparatus including a cycle adjustment unit, wherein the long cycle search unit executes a full search for a database server on a network based on the search condition. The first step (for example, step S1 in FIG. 2), and the short cycle search means causes the database server on the network to update the latest data more frequently than the long cycle search means based on the search conditions. And a second step (for example, step S2 in FIG. 2) for executing a search that prioritizes the search, and the search accompanying information storage means The third step (for example, FIG. 2) stores the search result acquired in the step 2 and the second step in association with the search means that acquired the search result and the time when the search result was acquired. Step S3) and a fourth step (for example, step S4 in FIG. 2) in which the search result storage means stores the search results obtained in the first step and the second step with duplicates eliminated. ) And the short cycle adjusting unit refers to the search accompanying information storage unit, and the long cycle search unit finds the number of search results acquired per unit time, and matches the search condition in the database server A search performed by the short cycle search unit based on the average increase number of data per unit time and the number of search results acquired per unit time by the short cycle search unit A fifth step (for example, step S6 in FIG. 2) for determining the period of the search, and the long cycle adjustment means refer to the search associated information storage means, and for the search result acquired by the short cycle search means, A sixth step (for example, step S7 in FIG. 2) for adjusting a search cycle executed by the long cycle search unit based on the degree of duplication of the search results acquired by the long cycle search unit. A program to execute it is proposed.

本発明によれば、データのリアルタイム性と完全性とを両立して、ネットワーク上のデータを効率よく収集することができる。 According to the present invention, it is possible to efficiently collect data on a network while achieving both real-time property and completeness of data.

本発明の第１の実施形態に係るデータ収集装置の構成を示す図である。It is a figure which shows the structure of the data collection device which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係るデータ収集装置における、データ収集処理のフローチャートを示す図である。It is a figure which shows the flowchart of the data collection process in the data collection device which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係るデータ収集装置の構成を示す図である。It is a figure which shows the structure of the data collection device which concerns on the 2nd Embodiment of this invention. データベースサーバにおける、ワードＡ、Ｂ、Ｃのデータ集合の一例を示す図である。It is a figure which shows an example of the data set of the words A, B, and C in a database server. Ｔｗｉｔｔｅｒに適用した本発明のデータ収集装置の構成を示す図である。It is a figure which shows the structure of the data collection device of this invention applied to Twitter.

以下、図面を用いて、本発明の実施形態について詳細に説明する。なお、本実施形態における構成要素は適宜、既存の構成要素等との置き換えが可能であり、また、他の既存の構成要素との組み合わせを含む様々なバリエーションが可能である。したがって、本実施形態の記載をもって、特許請求の範囲に記載された発明の内容を限定するものではない。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that the constituent elements in the present embodiment can be appropriately replaced with existing constituent elements and the like, and various variations including combinations with other existing constituent elements are possible. Therefore, the description of the present embodiment does not limit the contents of the invention described in the claims.

＜第１の実施形態＞
＜データ収集装置の構成＞
図１は、本発明の第１の実施形態に係るデータ収集装置１００の構成を示す図である。ここで、データ収集装置１００は、ネットワーク上のデータベースサーバ、例えば、ＳＮＳや掲示板等のデータを記憶しているデータベースサーバから、データを効率よく収集する装置である。図１に示すように、データ収集装置１００は、検索条件記憶部１１０、長周期検索部１２０、短周期検索部１３０、データ記憶部１４０、短周期調整部１５０、長周期調整部１６０、およびオンデマンド検索部１７０から構成される。 <First Embodiment>
<Configuration of data collection device>
FIG. 1 is a diagram showing a configuration of a data collection apparatus 100 according to the first embodiment of the present invention. Here, the data collection device 100 is a device that efficiently collects data from a database server on the network, for example, a database server that stores data such as an SNS or a bulletin board. As shown in FIG. 1, the data collection device 100 includes a search condition storage unit 110, a long cycle search unit 120, a short cycle search unit 130, a data storage unit 140, a short cycle adjustment unit 150, a long cycle adjustment unit 160, and an on-state. The demand search unit 170 is configured.

検索条件記憶部１１０は、ネットワーク上のデータベースサーバから、ユーザが収集したいデータの検索条件を記憶する。 The search condition storage unit 110 stores search conditions for data that the user wants to collect from a database server on the network.

長周期検索部１２０は、短周期検索部１３０の検索周期と比較して長い周期で、ネットワーク上のデータベースサーバを、検索条件記憶部１１０に記憶されている検索条件に基づいて全検索する。なお、長周期検索部１２０は、検索対象のデータ数や検索対象の期間が予め設定されている場合には、設定された検索対象のデータ数や検索対象の期間の範囲で全検索を実行する。なお、長周期検索部１２０は、最初はユーザ等によって設定された検索周期により検索を実行し、長周期調整部１６０によって検索周期が調整されると、調整された検索周期で検索を実行する。 The long cycle search unit 120 searches all the database servers on the network based on the search conditions stored in the search condition storage unit 110 in a cycle longer than the search cycle of the short cycle search unit 130. When the number of search target data and the search target period are set in advance, the long-cycle search unit 120 executes all searches within the set range of the search target data and the search target period. . The long cycle search unit 120 first executes a search with a search cycle set by a user or the like, and executes a search with the adjusted search cycle when the long cycle adjustment unit 160 adjusts the search cycle.

ここで、本実施形態において、長周期検索部１２０は、ウェブページ中のリンクを辿り、別のウェブページを収集するという動作を繰り返すことで検索結果を得る、リンクを辿る検索方式、または、ＨＴＴＰのＧＥＴコマンドを用いてキーワードによりウェブページを検索し、キーワードに一致するウェブページを検索結果として得る、キーワード検索方式のいずれかにより検索を行う。キーワード検索方式の場合には、例えば、Ｔｗｉｔｔｅｒ（登録商標）が提供しているＡＰＩを利用することができる。 Here, in the present embodiment, the long-period search unit 120 follows a link in a web page and obtains a search result by repeating an operation of collecting another web page, or a search method that follows a link, or HTTP. A web page is searched by a keyword using the GET command, and a search is performed by any of the keyword search methods in which a web page matching the keyword is obtained as a search result. In the case of the keyword search method, for example, an API provided by Twitter (registered trademark) can be used.

短周期検索部１３０は、秒単位といった短い周期で、高頻度に、ネットワーク上のデータベースサーバを、検索条件記憶部１１０に記憶されている検索条件に基づいて、最新のデータを優先する検索を実行する。短周期検索部１３０も長周期検索部１２０と同様に、リンクを辿る検索方式またはキーワード検索方式により検索を実行する。なお、短周期検索部１３０も長周期検索部１２０と同様に、最初はユーザ等によって設定された検索周期により検索を実行し、短周期調整部１５０によって検索周期が調整されると、調整された検索周期で検索を実行する。 The short cycle search unit 130 frequently searches the database server on the network based on the search conditions stored in the search condition storage unit 110 with a short cycle such as a second unit. To do. Similar to the long-cycle search unit 120, the short-cycle search unit 130 performs a search by a search method that tracks links or a keyword search method. As with the long cycle search unit 120, the short cycle search unit 130 first performs a search using a search cycle set by a user or the like, and is adjusted when the short cycle adjustment unit 150 adjusts the search cycle. Perform a search at the search interval.

短周期検索部１３０は、データベースサーバの全データを検索対象としてもよいし、長周期検索部１２０による検索が実行開始された以降にデータベースサーバに蓄積されたデータを検索対象としてもよい。長周期検索部１２０による検索が実行開始された以降にデータベースサーバに蓄積されたデータを検索対象とすると、データ収集装置１００のリソースが限られている場合に、効率的に長周期検索部１２０による検索と短周期検索部１３０による検索とが実行できる。 The short cycle search unit 130 may search all data in the database server, or may search data stored in the database server after the search by the long cycle search unit 120 is started. If the data accumulated in the database server after the start of the search by the long-cycle search unit 120 is set as a search target, the long-cycle search unit 120 efficiently performs the processing when the resource of the data collection device 100 is limited. The search and the search by the short cycle search unit 130 can be executed.

データ記憶部１４０は、検索付随情報記憶部１４１、および検索結果記憶部１４２を備える。 The data storage unit 140 includes a search accompanying information storage unit 141 and a search result storage unit 142.

検索付随情報記憶部１４１は、長周期調整部１６０および短周期調整部１５０で周期を調整する際に用いる情報を記憶する。具体的には、検索付随情報記憶部１４１は、長周期検索部１２０および短周期検索部１３０で取得された検索結果を、検索結果を取得した検索部を特定する情報および検索結果を取得した時間と、対応付けて記憶する。検索付随情報記憶部１４１のデータは、長周期調整部１６０および短周期調整部１５０で周期が調整された後や、予め設定された間隔や、容量がオーバーした際に削除される。 The search accompanying information storage unit 141 stores information used when the cycle is adjusted by the long cycle adjustment unit 160 and the short cycle adjustment unit 150. Specifically, the search accompanying information storage unit 141 uses the search results acquired by the long cycle search unit 120 and the short cycle search unit 130 as information for specifying the search unit that acquired the search results and the time when the search results are acquired. Are stored in association with each other. The data in the search associated information storage unit 141 is deleted after the cycle is adjusted by the long cycle adjusting unit 160 and the short cycle adjusting unit 150, or when a preset interval or capacity is exceeded.

検索結果記憶部１４２は、長周期検索部１２０および短周期検索部１３０で取得された検索結果を、重複を排除して記憶する。検索結果記憶部１４２には、例えば、入出力が高速なインメモリＤＢを用いる。検索結果記憶部１４２に記憶されているデータはリアルタイム性と完全性とが兼ね備えているので、マーケティングやビジネス計画に有効に利用することができる。 The search result storage unit 142 stores the search results acquired by the long-cycle search unit 120 and the short-cycle search unit 130 without duplication. The search result storage unit 142 uses, for example, an in-memory DB with high input / output speed. Since the data stored in the search result storage unit 142 has both real-time characteristics and completeness, it can be effectively used for marketing and business planning.

短周期調整部１５０は、検索付随情報記憶部１４１に記憶されている短周期検索部１３０で取得された検索結果を参照し、単位時間当たりの検索条件の出現頻度［回／ｓ］と、１回の検索で取得する平均検索結果の件数［件数／回］とを算出する。そして、短周期調整部１５０は、検索条件の単位時間当たりの出現頻度［回／ｓ］を、１回の検索で取得する平均検索結果の件数［件数／回］で割り、短周期検索部１３０で単位時間当たりに取得する検索結果の件数（以下、短周期取得件数）［件数／ｓ］を算出する。 The short cycle adjustment unit 150 refers to the search result acquired by the short cycle search unit 130 stored in the search incidental information storage unit 141, and the appearance frequency [times / s] of the search condition per unit time and 1 Calculate the number of average search results [number of times / times] obtained in a single search. Then, the short cycle adjustment unit 150 divides the appearance frequency [times / s] per unit time of the search condition by the number [number / times] of average search results acquired by one search, and the short cycle search unit 130. The number of search results acquired per unit time (hereinafter referred to as the number of short-cycle acquisitions) [number / s] is calculated.

また、短周期調整部１５０は、検索付随情報記憶部１４１に記憶されている長周期検索部１２０で取得された検索結果を参照し、単位時間当たりの検索条件の出現頻度［回／ｓ］と、１回の検索で取得する平均検索結果の件数［件数／回］とを算出する。そして、短周期調整部１５０は、検索条件の単位時間当たりの出現頻度［回／ｓ］を、１回の検索で取得する平均検索結果の件数［件数／回］で割り、長周期検索部１２０で単位時間当たりに取得する検索結果の件数［件数／ｓ］を算出する。このようにして算出される、長周期検索部１２０で単位時間当たりに取得する検索結果の件数は、データベースサーバにおいて、検索条件に一致するデータの単位時間当たりの平均増加件数に当たる。 In addition, the short cycle adjustment unit 150 refers to the search result acquired by the long cycle search unit 120 stored in the search accompanying information storage unit 141, and the appearance frequency [times / s] of the search condition per unit time is obtained. The number of average search results acquired in one search [number of times / times] is calculated. Then, the short cycle adjustment unit 150 divides the appearance frequency [times / s] per unit time of the search condition by the number [number / times] of average search results acquired in one search, and the long cycle search unit 120. The number of search results obtained per unit time [number / s] is calculated. The number of search results obtained per unit time by the long-cycle search unit 120 calculated in this way corresponds to the average increase number per unit time of data that matches the search condition in the database server.

そして、短周期調整部１５０は、短周期取得件数と平均増加件数とが、同じになるように、短周期検索部１３０の周期を調整する。具体的には、短周期取得件数が平均増加件数よりも大きい場合には、周期を短くし、一方、短周期取得件数が平均増加件数よりも小さい場合には、検索周期を長くする。それにより、短周期検索部１３０の周期を、取得された検索結果の件数から決定することにより、最適な周期にすることができる。 Then, the short cycle adjustment unit 150 adjusts the cycle of the short cycle search unit 130 so that the number of short cycle acquisitions and the average increase number are the same. Specifically, when the number of short cycle acquisitions is larger than the average increase number, the cycle is shortened. On the other hand, when the number of short cycle acquisitions is smaller than the average increase number, the search cycle is lengthened. Thereby, the cycle of the short cycle search unit 130 can be set to an optimal cycle by determining the number of acquired search results.

また、短周期調整部１５０は、検索付随情報記憶部１４１を参照し、長周期検索部１２０で取得された検索結果に対する、短周期検索部１３０で取得された検索結果の重複度合に基づいて、短周期検索部１３０が実行する検索周期を調整する。具体的には、短周期検索部１３０で取得された検索結果と重複しない長周期検索部１２０で取得された検索結果、すなわち、長周期検索部１２０では取得されているが、短周期検索部１３０で取得できていない検索結果がある場合には、短周期検索部１３０の検索周期を短くする。それにより、検索結果漏れが生じない検索周期に調整することができる。 Further, the short cycle adjustment unit 150 refers to the search accompanying information storage unit 141, and based on the degree of duplication of the search results acquired by the short cycle search unit 130 with respect to the search results acquired by the long cycle search unit 120, The search cycle executed by the short cycle search unit 130 is adjusted. Specifically, the search result acquired by the long cycle search unit 120 that does not overlap with the search result acquired by the short cycle search unit 130, that is, the search result acquired by the long cycle search unit 120, but the short cycle search unit 130. If there is a search result that could not be acquired in step 1, the search cycle of the short cycle search unit 130 is shortened. Thereby, it is possible to adjust to a search cycle in which search result leakage does not occur.

さらに、短周期調整部１５０は、検索付随情報記憶部１４１を参照し、短周期検索部１３０において、前後の検索により取得された検索結果の重複度合に基づいて、短周期検索部１３０が実行する検索周期を調整する。具体的には、短周期検索部１３０において、前の検索で取得された検索結果と、後の検索で取得された検索結果とが重複している場合には、短周期検索部１３０の検索周期を長くする。なお、前の検索で取得された検索結果と後の検索で取得された検索結果との重複が多い程、検索周期を長くする。それにより、リソースを効率的に使用できる、適切な検索周期に調整することができる。 Further, the short cycle adjustment unit 150 refers to the search accompanying information storage unit 141, and the short cycle search unit 130 executes the short cycle search unit 130 based on the degree of duplication of search results obtained by previous and subsequent searches. Adjust the search interval. Specifically, in the short cycle search unit 130, when the search result acquired in the previous search and the search result acquired in the subsequent search overlap, the search cycle of the short cycle search unit 130 Lengthen. Note that the longer the search results acquired in the previous search and the search results acquired in the subsequent search, the longer the search cycle. Thereby, it can adjust to the suitable search period which can use a resource efficiently.

このようにして、短周期検索部１３０の検索周期を短周期調整部１５０で調整することにより、曜日や時刻、内容によって生じる、データベースサーバのデータ量や検索条件の出現頻度の変動に対応可能に、短周期検索部１３０の検索周期に調整することができる。 In this way, by adjusting the search cycle of the short cycle search unit 130 by the short cycle adjustment unit 150, it is possible to cope with fluctuations in the data amount of the database server and the appearance frequency of the search conditions caused by the day of the week, time, and contents. The search cycle of the short cycle search unit 130 can be adjusted.

長周期調整部１６０は、検索付随情報記憶部１４１を参照し、短周期検索部１３０で取得された検索結果に対する、長周期検索部１２０で取得された検索結果の重複度合に基づいて、長周期検索部１２０が実行する検索の周期を調整する。具体的には、長周期検索部１２０で取得された検索結果と重複しない短周期検索部１３０で取得された検索結果、すなわち、短周期検索部１３０では取得されているが、長周期検索部１２０で取得できていない検索結果がある場合には、長周期検索部１２０の検索周期を短くする。それにより、時間経過によりデータベースサーバから削除されてしまったために、長周期検索部１２０で取得できなかったデータを発見することができ、データの完全性を保つことができる。 The long cycle adjustment unit 160 refers to the search-accompanying information storage unit 141 and determines the long cycle based on the degree of duplication of the search results acquired by the long cycle search unit 120 with respect to the search results acquired by the short cycle search unit 130. The search cycle performed by the search unit 120 is adjusted. Specifically, the search result acquired by the short cycle search unit 130 that does not overlap with the search result acquired by the long cycle search unit 120, that is, the search result acquired by the short cycle search unit 130, but the long cycle search unit 120. If there is a search result that cannot be acquired in step 1, the search cycle of the long cycle search unit 120 is shortened. As a result, since data has been deleted from the database server over time, data that could not be acquired by the long-cycle search unit 120 can be found, and the integrity of the data can be maintained.

また、長周期調整部１６０は、検索付随情報記憶部１４１を参照し、長周期検索部１２０において、前後の検索により取得された検索結果の重複度合に基づいて、長周期検索部１２０が実行する検索周期を調整する。具体的には、長周期検索部１２０において、後の検索で取得された検索結果に、前の検索で取得された検索結果が全て含まれる場合には、検索周期を長くする。それにより、リソースを効率的に使用できる、適切な検索周期に調整することができる。 Further, the long cycle adjustment unit 160 refers to the search accompanying information storage unit 141, and the long cycle search unit 120 executes the long cycle search unit 120 based on the degree of overlap of the search results obtained by the previous and subsequent searches. Adjust the search interval. Specifically, in the long cycle search unit 120, if the search results acquired in the subsequent search include all the search results acquired in the previous search, the search cycle is lengthened. Thereby, it can adjust to the suitable search period which can use a resource efficiently.

なお、短周期調整部１５０および長周期調整部１６０は、予め設定された曜日や日時、間隔によって、検索周期の調整を行う。 Note that the short cycle adjustment unit 150 and the long cycle adjustment unit 160 adjust the search cycle according to a preset day of the week, date, and interval.

オンデマンド検索部１７０は、ユーザから入力された検索条件を受け付けたことに応じて、ネットワーク上のデータベースサーバに対し、受け付けた検索条件に基づいて検索を行う。そして、オンデマンド検索部１７０は、検索結果を検索結果記憶部１４２に記憶させる。それにより、検索条件記憶部１１０に記憶されていない検索条件についても、検索を行うことができる。なお、オンデマンド検索部１７０は、データ収集装置１００に備えられなくてもよい。 The on-demand search unit 170 searches the database server on the network based on the received search condition in response to receiving the search condition input by the user. The on-demand search unit 170 stores the search result in the search result storage unit 142. Thereby, it is possible to perform a search even for search conditions that are not stored in the search condition storage unit 110. The on-demand search unit 170 may not be included in the data collection device 100.

図２を用いて、本発明の第１の実施形態に係るデータ収集装置１００における、データ収集処理のフローチャートについて説明する。なお、ステップＳ１とステップＳ２とは互いに独立しており、検索開始の指示を受け付けるとそれぞれ開始する。 The flowchart of the data collection process in the data collection device 100 according to the first embodiment of the present invention will be described with reference to FIG. Note that step S1 and step S2 are independent of each other, and are started when a search start instruction is received.

まず、ステップＳ１において、長周期検索部１２０が、短周期検索部１３０の検索周期と比較して長い周期で、ネットワーク上のデータサーバを、検索条件記憶部１１０に記憶されている検索条件に基づいて全検索する。 First, in step S <b> 1, the long cycle search unit 120 selects a data server on the network based on the search condition stored in the search condition storage unit 110 in a cycle longer than the search cycle of the short cycle search unit 130. To search all.

次に、ステップＳ２において、短周期検索部１３０が、秒単位といった短い周期で、高頻度に、ネットワーク上のデータサーバを、検索条件記憶部１１０に記憶されている検索条件に基づいて、最新のデータを優先して検索する。 Next, in step S2, the short cycle search unit 130 updates the data server on the network frequently with a short cycle such as a second unit based on the search condition stored in the search condition storage unit 110. Search by prioritizing data.

次に、ステップＳ３において、検索付随情報記憶部１４１は、長周期検索部１２０および短周期検索部１３０で取得された検索結果を、検索結果を取得した検索部を特定する情報および検索結果を取得した時間情報と、対応付けて記憶する。 Next, in step S3, the search-accompanying information storage unit 141 acquires the search results acquired by the long-cycle search unit 120 and the short-cycle search unit 130, information for specifying the search unit that acquired the search results, and the search results. Stored in association with the time information.

次に、ステップＳ４において、検索結果記憶部１４２は、長周期検索部１２０および短周期検索部１３０で取得された検索結果を、重複を排除して記憶する。なお、ステップＳ３とステップＳ４の順は逆であってもよいし、同時であってもよい。 Next, in step S4, the search result storage unit 142 stores the search results acquired by the long cycle search unit 120 and the short cycle search unit 130 without duplication. Note that the order of step S3 and step S4 may be reversed or simultaneous.

次に、ステップＳ５において、短周期調整部１５０および長周期調整部１６０は、検索周期の調整を行うか否かを判断する。検索周期の調整を行うと判断した場合（ＹＥＳ）には、ステップＳ６に処理を進め、一方、検索周期の調整を行わないと判断した場合（ＮＯ）には、ステップＳ８に処理を進める。 Next, in step S5, the short cycle adjustment unit 150 and the long cycle adjustment unit 160 determine whether or not to adjust the search cycle. If it is determined that the search period is to be adjusted (YES), the process proceeds to step S6. On the other hand, if it is determined that the search period is not to be adjusted (NO), the process proceeds to step S8.

次に、ステップＳ６において、短周期調整部１５０が、検索付随情報記憶部１４１に記憶されている検索結果を参照し、短周期検索部１３０の検索周期を調整する。 Next, in step S <b> 6, the short cycle adjustment unit 150 refers to the search result stored in the search accompanying information storage unit 141 and adjusts the search cycle of the short cycle search unit 130.

次に、ステップＳ７において、長周期調整部１６０が、検索付随情報記憶部１４１に記憶されている検索結果を参照し、長周期検索部１２０の検索周期を調整する。なお、ステップＳ６とステップＳ７の順は逆であってもよいし、同時であってもよい。 Next, in step S <b> 7, the long cycle adjustment unit 160 refers to the search result stored in the search accompanying information storage unit 141 and adjusts the search cycle of the long cycle search unit 120. Note that the order of step S6 and step S7 may be reversed or simultaneous.

次に、ステップＳ８において、長周期検索部１２０および短周期検索部１３０は、終了指示があったか否か判断する。終了指示があったと判断した場合（ＹＥＳ）には、処理を終了し、一方、終了指示がなかったと判断した場合（ＮＯ）には、ステップＳ１およびＳ２に処理を戻す。 Next, in step S8, the long cycle search unit 120 and the short cycle search unit 130 determine whether or not there is an end instruction. If it is determined that there is an end instruction (YES), the process ends. On the other hand, if it is determined that there is no end instruction (NO), the process returns to steps S1 and S2.

以上、説明したように、本実施形態によれば、長周期検索部１２０による全検索と合わせて、短周期検索部１３０により短い周期で高頻度に最新データを検索することにより、従来取得できなかった、長周期検索部１２０で検索が開始された後にデータベースサーバに蓄積されたデータを取得することができる。それにより、データのリアルタイム性と完全性とを両立して、ネットワーク上のデータを効率よく収集することができる。また、収集した検索結果に基づいて、長周期検索部１２０および短周期検索部１３０の検索周期を調整することにより、よりデータのリアルタイム性と完全性とを両立して、ネットワーク上のデータを効率よく収集することができる。 As described above, according to the present embodiment, it is impossible to acquire the latest data by searching the latest data frequently in a short cycle with the short cycle search unit 130 together with all the searches by the long cycle search unit 120. In addition, data accumulated in the database server after the search is started by the long cycle search unit 120 can be acquired. As a result, it is possible to efficiently collect data on the network while achieving both real-time property and completeness of the data. In addition, by adjusting the search cycle of the long-cycle search unit 120 and the short-cycle search unit 130 based on the collected search results, the data on the network can be efficiently made compatible with both real-time and completeness of data. Can be collected well.

＜第２の実施形態＞
図３を用いて、本発明の第２の実施形態について説明する。なお、本実施形態におけるデータ収集装置は、検索条件が複数の場合に、各検索条件に割り当てるリソースを制御して、ネットワーク上のデータベースサーバ、例えば、ＳＮＳや掲示板等のデータを記憶しているデータベースサーバから、データを効率よく収集する装置である。なお、第１の実施形態と同一の符号を付す構成要素については、同一の機能を有することから、その詳細な説明は省略する。 <Second Embodiment>
A second embodiment of the present invention will be described with reference to FIG. Note that the data collection device according to the present embodiment controls a resource allocated to each search condition when there are a plurality of search conditions, and stores a database server such as an SNS or a bulletin board on the network. A device that efficiently collects data from a server. In addition, about the component which attaches | subjects the same code | symbol as 1st Embodiment, since it has the same function, the detailed description is abbreviate | omitted.

＜データ収集装置の構成＞
図３は、本発明の第２の実施形態に係るデータ収集装置１０１の構成を示す図である。図３に示すように、データ収集装置１０１は、検索条件記憶部１１０、長周期検索部１２１、短周期検索部１３１、データ記憶部１４０、長周期調整部１６０、短周期調整部１５０、リソース配分決定部１８０、およびリソース制御部１９０から構成される。 <Configuration of data collection device>
FIG. 3 is a diagram showing the configuration of the data collection apparatus 101 according to the second embodiment of the present invention. As shown in FIG. 3, the data collection device 101 includes a search condition storage unit 110, a long cycle search unit 121, a short cycle search unit 131, a data storage unit 140, a long cycle adjustment unit 160, a short cycle adjustment unit 150, and resource allocation. It comprises a determination unit 180 and a resource control unit 190.

長周期検索部１２１および短周期検索部１３１は、検索条件記憶部１１０に記憶されている検索条件が複数である場合に、各検索条件の検索プロセスを並列に実行する。なお、長周期検索部１２１および短周期検索部１３１は、第１の実施形態に記述した機能を備えている。 When there are a plurality of search conditions stored in the search condition storage unit 110, the long cycle search unit 121 and the short cycle search unit 131 execute a search process for each search condition in parallel. Note that the long cycle search unit 121 and the short cycle search unit 131 have the functions described in the first embodiment.

リソース配分決定部１８０は、検索条件記憶部１１０に記憶されている検索条件が複数である場合に、検索付随情報記憶部１４１を参照し、各検索条件の検索結果の取得率と取得量とに基づいて、各検索条件へのリソースの配分割合を決定する。具体的には、リソース配分決定部１８０は、各検索条件の検索結果の取得率Ｔ_ｉが同じになるように、下記の（１）式と（２）式によりＮ個のｒを変数とするＮ次連立方程式を解くことにより、各検索条件へのリソース配分を決定する。 When there are a plurality of search conditions stored in the search condition storage unit 110, the resource allocation determination unit 180 refers to the search accompanying information storage unit 141 and determines the acquisition rate and the acquisition amount of the search results of each search condition. Based on this, the distribution ratio of resources to each search condition is determined. Specifically, the resource allocation determining unit 180, as gain coefficient T _i of the search results for each search condition is the same, a variable of N r by the expression (1) and (2) below By solving the Nth order simultaneous equations, the resource allocation to each search condition is determined.

リソース制御部１９０は、各検索条件の検索を実行するプロセスに、リソース配分決定部１８０で決定された各検索条件へのリソース配分に基づいてリソースを割り当てる。具体的には、リソース制御部１９０は、各検索条件の検索プロセスは割り当てられたリソースに基づいて、検索プロセスの数や検索頻度を調整する。 The resource control unit 190 allocates resources to the process for executing the search for each search condition based on the resource allocation to each search condition determined by the resource allocation determination unit 180. Specifically, the resource control unit 190 adjusts the number of search processes and the search frequency based on the assigned resources.

ところで、検索条件が複数の場合に、各検索条件に同じリソースを配分すると効率的でないことがある。例えば、ワードＡ、Ｂ、Ｃが図４に示すようなデータ集合をもつデータベースサーバを検索対象とする場合に、ワードＡ、Ｂ、Ｃにリソースを等分で割り当てるとワードＡ、Ｂについては十分にデータを得ることができたにも関わらず、ワードＣについては、リソース不足で十分にデータを得ることができないことがあった。 By the way, when there are a plurality of search conditions, it may not be efficient to allocate the same resource to each search condition. For example, when searching for a database server in which words A, B, and C have a data set as shown in FIG. 4, if resources are equally allocated to words A, B, and C, words A and B are sufficient. In some cases, however, data could not be sufficiently obtained for word C due to insufficient resources.

本実施形態においては、上述したリソース配分決定部１８０およびリソース制御部１９０をデータ収集装置１００に備えることにより、検索条件が複数の場合に、各検索条件により得られる検索結果の件数が同じになるように、リソースを配分することができる。その結果、検索条件が複数の場合にも効率的に、データを収集することができる。 In the present embodiment, by providing the above-described resource allocation determination unit 180 and resource control unit 190 in the data collection device 100, when there are a plurality of search conditions, the number of search results obtained by each search condition is the same. Resources can be allocated. As a result, data can be efficiently collected even when there are a plurality of search conditions.

＜実施例＞
本実施形態のデータ収集装置１０１をＴｗｉｔｔｅｒに適用した例について説明する。なお、第１の実施形態および第２の実施形態と同一の符号を付す構成要素については、同一の機能を有することから、その詳細な説明は省略する。 <Example>
An example in which the data collection apparatus 101 of this embodiment is applied to Twitter will be described. In addition, about the component which attaches | subjects the same code | symbol as 1st Embodiment and 2nd Embodiment, since it has the same function, the detailed description is abbreviate | omitted.

本例において、図５に示すように、データ収集装置２００は、検索条件記憶部１１０、検索部２１０、データ記憶部１４０、Ｓｒｒｅａｍ処理部２２０、検索エンジン２３０、および高度解析器２４０から構成される。なお、Ｓｒｒｅａｍ処理部２２０、検索エンジン２３０、および高度解析器２４０は、データ記憶部１４０に記憶されているデータをマーケティング等のための解析を行う処理部であって、別装置に備えられてもよい。 In this example, as shown in FIG. 5, the data collection device 200 includes a search condition storage unit 110, a search unit 210, a data storage unit 140, a Srrem processing unit 220, a search engine 230, and an altitude analyzer 240. . The Srrem processing unit 220, the search engine 230, and the advanced analyzer 240 are processing units that analyze the data stored in the data storage unit 140 for marketing and the like, and may be provided in another device. Good.

検索部２１０は、キーワード指定の検索を行うキーワード検索部２１１と、ユーザ指定の検索を行うユーザ検索部２１２とを備える。Ｔｗｉｔｔｅｒの全Ｔｗｅｅｔに対して、マーケティング等に利用するＴｗｅｅｔは、主にキーワードやユーザによって収集することができるので、本例においては、キーワード検索とユーザ検索との２つの検索方法によりデータ収集を行う。 The search unit 210 includes a keyword search unit 211 that performs a keyword-designated search and a user search unit 212 that performs a user-designated search. Tweets used for marketing etc. can be collected mainly by keywords and users for all Twitter tweets, and in this example, data is collected by two search methods, keyword search and user search. .

検索部２１０の各処理部は、それぞれ最低１プロセスずつ並列に動作し、例えば、ＴｗｉｔｔｅｒＲＥＳＴＡＰＩを利用してＴｗｉｔｔｅｒのデータベースサーバからＴｗｅｅｔを収集する。ユーザの要望により複数のキーワードやユーザについてデータを取得する場合には、必要に応じて、各処理部のプロセス数を増加する。 Each processing unit of the search unit 210 operates in parallel by at least one process, and for example, collects Tweet from a Twitter database server using the Twitter REST API. When data is acquired for a plurality of keywords or users according to the user's request, the number of processes in each processing unit is increased as necessary.

キーワード検索部２１１は、キーワード長周期検索部２１１ａ、キーワード短周期検索部２１１ｂ、ひらがな検索部２１１ｃ、およびキーワードオンデマンド検索部２１１ｄを備える。 The keyword search unit 211 includes a keyword long cycle search unit 211a, a keyword short cycle search unit 211b, a hiragana search unit 211c, and a keyword on-demand search unit 211d.

キーワード長周期検索部２１１ａは、Ｔｗｉｔｔｅｒのデータベースサーバから、検索条件記憶部１１０に記憶されているキーワードを含むＴｗｅｅｔを全検索する。キーワード長周期検索部２１１ａが検索する範囲を、検索対象数や検索期間を限定してもよい。キーワード短周期検索部２１１ｂは、秒単位といった短い周期で、高頻度に、Ｔｗｉｔｔｅｒのデータベースサーバから、検索条件記憶部１１０に記憶されている検索条件に基づいて、最新のＴｗｅｅｔを優先して検索する。 The keyword long cycle search unit 211a searches all tweets including keywords stored in the search condition storage unit 110 from the Twitter database server. The search range of the keyword long cycle search unit 211a may be limited to the number of search targets and the search period. The keyword short cycle search unit 211b preferentially searches the latest Tweet from the Twitter database server based on the search conditions stored in the search condition storage unit 110 with a short cycle such as seconds. .

ひらがな検索部２１１ｃは、特許文献２に記載の技術を用いて、ひらがなをキーワードに、日本語のＴｗｅｅｔを検索する。キーワードオンデマンド検索部２１１ｄは、検索条件記憶部１１０に記憶されていないキーワードに関するＴｗｅｅｔをユーザが収集したい場合に、Ｔｗｉｔｔｅｒのデータベースサーバから、ユーザから入力を受け付けたキーワードに基づいてＴｗｅｅｔを検索する。 The hiragana search unit 211c uses the technique described in Patent Document 2 to search for a Japanese Tweet using hiragana as a keyword. The keyword on-demand search unit 211d searches the Twitter database server for a Tweet based on the keyword received from the user when the user wants to collect a Tweet associated with a keyword not stored in the search condition storage unit 110.

ユーザ検索部２１２は、既存ユーザ検索部２１２ａ、新規ユーザ検出部２１２ｂ、およびユーザオンデマンド検索部２１２ｃを備える。 The user search unit 212 includes an existing user search unit 212a, a new user detection unit 212b, and a user on-demand search unit 212c.

既存ユーザ検索部２１２ａは、Ｔｗｉｔｔｅｒのデータベースサーバから、検索条件記憶部１１０に記憶されているユーザＩＤに基づいて、Ｔｗｅｅｔを全検索する。また、新規ユーザ検出部２１２ｂで検出されたユーザのユーザＩＤに基づいて、Ｔｗｅｅｔを全検索する。この場合、特許文献３（特開２０１２−２１６１６８号公報）の方法によりユーザの優先度に応じて、Ｔｗｅｅｔを検索してもよい。このようにして、ユーザがＴｗｅｅｔの収集を要求している、Ｔｗｉｔｔｅｒのユーザについて、日単位でリアルタイム性を保証することができる。 The existing user search unit 212a searches all tweets from the Twitter database server based on the user ID stored in the search condition storage unit 110. Also, all Tweets are searched based on the user ID of the user detected by the new user detection unit 212b. In this case, Tweet may be searched according to the priority of the user by the method of Patent Document 3 (Japanese Patent Laid-Open No. 2012-216168). In this manner, real-time performance can be guaranteed on a daily basis for a Twitter user who is requesting collection of Tweet.

新規ユーザ検出部２１２ｂは、キーワード検索部２１１で取得したＴｗｅｅｔを投稿したユーザであって、検索条件記憶部１１０に記憶されていないユーザを検出し、ユーザＩＤを取得する。新規ユーザ検出部２１２ｂで取得されたユーザＩＤに基づいて、上述した既存ユーザ検索部２１２ａが、Ｔｗｉｔｔｅｒのデータベースサーバから、Ｔｗｅｅｔを検索する。 The new user detection unit 212b detects a user who has posted Tweet acquired by the keyword search unit 211 and is not stored in the search condition storage unit 110, and acquires a user ID. Based on the user ID acquired by the new user detection unit 212b, the existing user search unit 212a described above searches the Twitter database for Twitter.

ユーザオンデマンド検索部２１２ｃは、Ｔｗｉｔｔｅｒのデータベースサーバから、ユーザから入力を受け付けたユーザＩＤに基づいて、Ｔｗｅｅｔを検索する。 The user on-demand search unit 212c searches for Twitter from the Twitter database server based on the user ID received from the user.

インメモリＤＢ１４２は、検索結果記憶部１４２に該当し、検索部２１０の各検索部２１１ａ〜ｄ、２１２ａ〜ｃの検索結果を記憶する。インメモリＤＢ１４２としては、例えば、Ｒｅｄｉｓ［２０１３年５月１５日検索、インターネット＜ＵＲＬ：ｈｔｔｐ：／／ｒｅｄｉｓ．ｉｏｈｔｔｐ：／／＞］が挙げられる。なお、インメモリＤＢ１４２は、並列化可能である。 The in-memory DB 142 corresponds to the search result storage unit 142, and stores the search results of the search units 211a to 211d and 212a to 212c of the search unit 210. As the in-memory DB 142, for example, Redis [May 15, 2013 search, Internet <URL: http: // redis. iotp: //>]. Note that the in-memory DB 142 can be parallelized.

各検索部２１１ａ〜ｄ、２１２ａ〜ｃはデータの完全性とリアルタイム性を実現するため、並列に動作し、また、検索条件が複数である場合に、各検索部２１１ａ〜ｄ、２１２ａ〜ｃはそれぞれ複数並列のプロセスを持つ。Ｔｗｉｔｔｅｒのように、個々は小さなデータであっても、データベースに非同期並列で挿入すると、データベースの挿入性能が大きく低下してしまう。さらには、各検索部２１１ａ〜ｄ、２１２ａ〜ｃの検索結果には、重複するＴｗｅｅｔが含まれるため、挿入パフォーマンスの低下を招きやすい。これらの問題を解消するために、検索結果記憶部１４２に入出力が高速なインメモリＤＢを用いることで、データを一度ストアすることで、重複したＴｗｅｅｔの除去とデータの１ストリーム化を行うことができる。 Each search unit 211a-d, 212a-c operates in parallel to achieve data integrity and real-time property, and when there are a plurality of search conditions, each search unit 211a-d, 212a-c Each has multiple parallel processes. Like Twitter, even if each piece is small data, if it is inserted into the database asynchronously and in parallel, the insertion performance of the database is greatly reduced. Furthermore, since the search results of the search units 211a to 211d and 212a to 212c include overlapping Tweets, the insertion performance is likely to deteriorate. In order to solve these problems, by using an in-memory DB with high input / output speed in the search result storage unit 142, the data is stored once, so that the duplicate Tweet is removed and the data is streamed into one stream. Can do.

Ｓｔｒｅａｍ処理部２２０は、インメモリＤＢ１４２を参照し、件数のカウント等１件のＴｗｅｅｔ毎に処理可能な解析処理を実施する。なお、Ｓｔｒｅａｍ処理部２２０も並列化が可能である。 The stream processing unit 220 refers to the in-memory DB 142 and performs an analysis process that can be processed for each Tweet, such as counting the number of cases. Note that the stream processing unit 220 can also be parallelized.

検索エンジン２３０は、高度解析器２４０からの指示に基づいて、インメモリＤＢ１４２を検索し、検索結果を高度解析器２４０に検索結果を渡す。検索エンジン２３０には、例えば、ＡｐａｔｃｈＳｏｌｒ［２０１３年５月１５日検索インターネット＜ＵＲＬ：ｈｔｔｐ：／／ｌｕｃｅｎｅ．ａｐａｃｈｅ．ｏｒｇ／ｓｏｌｒ／＞］が挙げられる。 The search engine 230 searches the in-memory DB 142 based on an instruction from the advanced analyzer 240 and passes the search result to the advanced analyzer 240. The search engine 230 includes, for example, Apache Solr [May 15, 2013 Search Internet <URL: http: // lucene. apache. org / solr />].

高度解析器２４０は、Ｔｗｉｔｔｅｒのデータベースサーバから収集したＴｗｅｅｔを記憶しているインメモリＤＢ１４２から検索エンジン２３０を介して必要なＴｗｅｅｔを取得し、マーケティング等のための解析を行う。高度解析器２４０には、例えば、非特許文献５［池田和史，服部元，小野智弘，東野輝夫，”マーケット分析のためのＴｗｉｔｔｅｒ投稿者プロフィール推定手法”，情処研報ＩＰＳＪ−ＣＤＳＶｏｌ．２，Ｎｏ．１，ｐｐ．８２−９３（２０１２）］、非特許文献６［池田和史，服部元，小野智弘，麻生英樹，”Ｔｗｉｔｔｅｒ解析による通信品質低下傾向の早期検出手法の提案”，ＦＩＴ２０１２］に示されるような高度な解析処理が実装される。 The advanced analyzer 240 acquires necessary Tweet via the search engine 230 from the in-memory DB 142 storing Tweet collected from the Twitter database server, and performs analysis for marketing and the like. For example, Non-Patent Document 5 [Kazufumi Ikeda, Motoshi Hattori, Tomohiro Ono, Teruo Higashino, “Twitter Contributor Profile Estimation Method for Market Analysis”, Information Processing Research Reports IPSJ-CDS Vol. 2, no. 1, pp. 82-93 (2012)], Non-Patent Document 6 [Kazufumi Ikeda, Motoshi Hattori, Tomohiro Ono, Hideki Aso, "Proposal of early detection method of communication quality deterioration tendency by Twitter analysis", FIT 2012] Analysis processing is implemented.

このようにして、データ収集装置をＴｗｉｔｔｅｒのＴｗｅｅｔの収集に用いると、Ｔｗｅｅｔを、リアルタイム性と完全性とを両立した上で、効率よく収集することができる。それにより、収集したＴｗｅｅｔを利用してマーケティング等のための解析を行うことができる。 In this way, when the data collection device is used for collecting Twitter's Tweet, it is possible to efficiently collect Tweet while achieving both real-time performance and completeness. Thereby, the analysis for marketing etc. can be performed using collected Tweet.

以上、説明したように、本実施形態によれば、複数の検索条件がある場合に、各検索条件に割り当てるリソースの配分を決定し、決定された配分に各検索条件のリソースを制御することにより、検索条件が複数の場合にも効率的に、データを収集することができる。 As described above, according to the present embodiment, when there are a plurality of search conditions, the allocation of resources to be allocated to each search condition is determined, and the resources of each search condition are controlled according to the determined allocation. Even when there are a plurality of search conditions, data can be efficiently collected.

なお、データ収集装置の処理をコンピュータシステムが読み取り可能な記録媒体に記録し、この記録媒体に記録されたデータ収集装置に読み込ませ、実行することによって本発明のデータ収集装置を実現することができる。ここでいうコンピュータシステムとは、ＯＳや周辺装置等のハードウェアを含む。 The data collection device of the present invention can be realized by recording the processing of the data collection device on a recording medium readable by a computer system, causing the data collection device recorded on the recording medium to read and execute the processing. . The computer system here includes an OS and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）システムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW (World Wide Web) system is used. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.

また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合せで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、この発明の実施形態につき、図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiments of the present invention have been described in detail with reference to the drawings. However, the specific configuration is not limited to the embodiments, and includes designs and the like that do not depart from the gist of the present invention.

１００データ収集装置
１１０検索条件記憶部
１２０長周期検索部
１３０短周期検索部
１４０データ記憶部
１４１検索付随情報記憶部
１４２検索結果記憶部
１５０短周期調整部
１６０長周期調整部
１７０画像データ管理部 DESCRIPTION OF SYMBOLS 100 Data collection device 110 Search condition memory | storage part 120 Long period search part 130 Short period search part 140 Data storage part 141 Search accompanying information storage part 142 Search result storage part 150 Short period adjustment part 160 Long period adjustment part 170 Image data management part

Claims

A data collection device for collecting data on a network,
Search condition storage means for storing search conditions set by the user;
Long-term search means for performing a full search on the database server on the network based on the search conditions;
A short-cycle search unit that executes a search that prioritizes the latest data more frequently than the long-cycle search unit based on the search condition;
Search associated information storage means for storing the search results acquired by the long cycle search means and the short cycle search means in association with the search means that acquired the search results and the time when the search results were acquired;
Search result storage means for storing the search results acquired by the long cycle search means and the short cycle search means by eliminating duplication;
An average increase in data per unit time that matches the search condition in the database server, which is obtained as the number of search results acquired per unit time by the long-cycle search means with reference to the search accompanying information storage means A short cycle adjusting unit that determines a search cycle performed by the short cycle search unit based on the number of search results acquired per unit time by the short cycle search unit;
A search performed by the long-period search unit with reference to the search-accompanying information storage unit and based on the degree of duplication of the search result acquired by the long-cycle search unit with respect to the search result acquired by the short-cycle search unit Long cycle adjusting means for adjusting the cycle of
A data collection device comprising:

The long cycle adjustment means refers to the search accompanying information storage means, and the long cycle search means performs a search performed by the long cycle search means based on the degree of overlap of search results obtained by previous and subsequent searches. The data collection device according to claim 1, wherein the period is adjusted.

The short cycle adjustment means refers to the search-accompanying information storage means, based on the degree of duplication of the search results acquired by the short cycle search means with respect to the search results acquired by the long cycle search means. The data collection device according to claim 1 or 2, wherein a cycle of search executed by the cycle search means is adjusted.

The short cycle adjustment means refers to the search-accompanying information storage means, and the short cycle search means performs a search performed by the short cycle search means based on the degree of overlap of the search results obtained by previous and subsequent searches. The data collection device according to claim 1, wherein the period is adjusted.

When the search condition storage unit stores a plurality of the search conditions,
Referring to the search-accompanying information storage unit, the distribution ratio of resources to each search condition is determined based on the acquisition rate and acquisition amount of the search result of each search condition stored in the search condition storage unit Resource allocation determination means;
The resource control means for allocating resources to the process of executing the search of each search condition in the short cycle search means and the long cycle search means based on a distribution ratio of resources to each search condition. The data collection device according to any one of claims 1 to 4.

6. The short-term search unit uses data accumulated in the database server after the search by the long-cycle search unit is started as a search target. The data collection device described.

In response to receiving a search condition from the user, the database server on the network includes an on-demand search means for performing a search based on the received search condition,
The data collection device according to claim 1, wherein the search result storage unit stores the search result acquired by the on-demand search unit without duplication.

Data collection comprising search condition storage means for storing search conditions set by the user, long cycle search means, short cycle search means, search accompanying information storage means, search result storage means, short cycle adjustment means, and long cycle adjustment means A data collection method in an apparatus, comprising:
A first step in which the long-period search means performs a full search on a database server on a network based on the search conditions;
A second step in which the short cycle search means performs a search for giving priority to the latest data to the database server on the network more frequently than the long cycle search means based on the search condition;
The search accompanying information storage means stores the search results acquired in the first step and the second step in association with the search means that acquired the search results and the time when the search results were acquired. A third step,
A fourth step in which the search result storage means stores the search results acquired in the first step and the second step by eliminating duplication; and
A unit of data that matches the search condition in the database server, wherein the short cycle adjustment unit is obtained as the number of search results acquired per unit time by the long cycle search unit with reference to the search accompanying information storage unit A fifth step of determining a search cycle to be executed by the short cycle search unit based on the average increase number per time and the number of search results acquired per unit time by the short cycle search unit;
The long cycle adjustment means refers to the search-accompanying information storage means, and based on the degree of duplication of the search results acquired by the long cycle search means with respect to the search results acquired by the short cycle search means, A sixth step of adjusting a search cycle executed by the cycle search means;
Data collection method including.

Data collection comprising search condition storage means for storing search conditions set by the user, long cycle search means, short cycle search means, search accompanying information storage means, search result storage means, short cycle adjustment means, and long cycle adjustment means A program for causing a computer to execute a data collection method in an apparatus,
A first step in which the long-period search means performs a full search on a database server on a network based on the search conditions;
A second step in which the short cycle search means performs a search for giving priority to the latest data to the database server on the network more frequently than the long cycle search means based on the search condition;
The search accompanying information storage means stores the search results acquired in the first step and the second step in association with the search means that acquired the search results and the time when the search results were acquired. A third step,
A fourth step in which the search result storage means stores the search results acquired in the first step and the second step by eliminating duplication; and
A unit of data that matches the search condition in the database server, wherein the short cycle adjustment unit is obtained as the number of search results acquired per unit time by the long cycle search unit with reference to the search accompanying information storage unit A fifth step of determining a search cycle to be executed by the short cycle search unit based on the average increase number per time and the number of search results acquired per unit time by the short cycle search unit;
The long cycle adjustment means refers to the search-accompanying information storage means, and based on the degree of duplication of the search results acquired by the long cycle search means with respect to the search results acquired by the short cycle search means, A sixth step of adjusting a search cycle executed by the cycle search means;
A program that causes a computer to execute.