JP2013073266A

JP2013073266A - Information processing device and program

Info

Publication number: JP2013073266A
Application number: JP2011209690A
Authority: JP
Inventors: Yoshitaka Hamaguchi; 佳孝濱口; Nobuyuki Nakamura; 信之中村
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2011-09-26
Filing date: 2011-09-26
Publication date: 2013-04-22
Anticipated expiration: 2031-09-26
Also published as: JP5782958B2

Abstract

PROBLEM TO BE SOLVED: To estimate a content of transmission data without referring to data parts in packets when a data transmission device has divided the transmission data into units of packets and transmitted the divided pieces of transmission data.SOLUTION: An information processing device estimates a content of transmission data transmitted from a data transmission device. The information processing device comprises: first information holding means which holds first feature information including a feature amount related to the data flow when the data transmission device holding a plurality of pieces of transmission data transmits the respective pieces of transmission data; second information holding means which holds second feature information including a feature amount related to the data flow when the data transmission device transmits any piece of transmission data; and means which collates the first feature information held by the first information holding means with the second feature information held by the second information holding means, and estimates transmission data transmitted by the data transmission device by using a result of the collation.

Description

この発明は、情報処理装置及びプログラムに関し、例えば、ネットワークのアクセス状況からユーザの行動傾向を分析するシステムに適用し得る。 The present invention relates to an information processing apparatus and a program, and can be applied to, for example, a system that analyzes a user's behavior tendency from a network access status.

従来、ユーザがユーザ端末を用いて、インターネット上のＷＷＷサーバ上のコンテンツへアクセスした場合に、そのアクセスに係る履歴を分析して、マーケティングやユーザマッチング広告等に利用することが行われている。 Conventionally, when a user uses a user terminal to access content on a WWW server on the Internet, a history related to the access is analyzed and used for marketing, user matching advertisement, or the like.

上述のようなユーザ（ユーザ端末）のアクセス履歴を分析する従来技術として特許文献１の記載技術がある。 As a conventional technique for analyzing the access history of the user (user terminal) as described above, there is a technique described in Patent Document 1.

特許文献１の記載技術では、ユーザに対してより有用な情報を提示するために、ユーザ（ユーザ端末）のコンテンツへのアクセス履歴から、そのユーザにとって重要と思われる単語を統計的に推定し、その単語からよりユーザにとって有用と思われるコンテンツを提供することについて記載されている。 In the technique described in Patent Document 1, in order to present more useful information to the user, a word that is considered important for the user is statistically estimated from the access history to the content of the user (user terminal), It describes the provision of content that seems more useful to the user from the words.

特開平１０−１６２０１１号公報Japanese Patent Laid-Open No. 10-162011

しかしながら、特許文献１の記載技術では、ユーザ（ユーザ端末）がコンテンツを指定するために用いたＵＲＬ等で、ユーザが閲覧したコンテンツを一意に特定できることが前提となっている。 However, the technique described in Patent Document 1 is based on the premise that the content viewed by the user can be uniquely specified by the URL or the like used by the user (user terminal) to specify the content.

しかし、実際のネットワーク環境では、ユーザ端末で師弟されたコンテンツを一意に特定できるＵＲＬのような情報を取得できない場合がある。例えば、ユーザ端末とＷＷＷサーバとの間のネットワーク経路上で、ユーザ端末が送受信するパケットを観測して、当該ユーザ端末でアクセスしたコンテンツのＵＲＬを取得しようとすると、パケットのペイロード部分のデータを読み込む必要があるが、その場合、以下のような問題が存在する。 However, in an actual network environment, there is a case where information such as a URL that can uniquely specify content trained by a user terminal cannot be acquired. For example, when a packet transmitted and received by the user terminal is observed on a network path between the user terminal and the WWW server, and the URL of the content accessed by the user terminal is acquired, the data of the payload portion of the packet is read. In this case, the following problems exist.

第１の問題点としては、パケットのペイロード部分が暗号化されている場合には、その暗号化の解除を行わなくては当該ユーザ端末でアクセスしたコンテンツのＵＲＬを取得することができないという問題がある。 The first problem is that when the payload portion of the packet is encrypted, the URL of the content accessed by the user terminal cannot be obtained without the decryption being canceled. is there.

第２の問題点としては、当該ユーザ端末で送受信するパケットが大量に存在する場合には、その中のいずれのパケットに、当該ユーザ端末でアクセスしたコンテンツのＵＲＬの情報が含まれているのかを把握するのに多大なコスト（処理量、記憶容量等）を要する。 As a second problem, when there are a large number of packets to be transmitted / received by the user terminal, it is determined which of the packets contains information on the URL of the content accessed by the user terminal. An enormous cost (processing amount, storage capacity, etc.) is required to grasp.

上述のような問題点に鑑みて、データ送信装置（例えば、ＷＥＢサーバ）がパケット単位に分割して送信データ（例えば、ＷＥＢコンテンツ）を送出した場合に、パケットのデータ部分を参照せずに、その送信データの内容を推定することができる情報処理装置及びプログラムが望まれている。 In view of the above problems, when a data transmission device (for example, a WEB server) divides into packet units and transmits transmission data (for example, WEB content), without referring to the data portion of the packet, There is a demand for an information processing apparatus and program that can estimate the contents of the transmission data.

第１の本発明の情報処理装置は、（１）複数の送信データを保持するデータ送信装置から、それぞれの送信データが送出される場合のデータの流れに関する特徴量を含む第１の特徴情報を保持する第１の情報保持手段と、（２）上記データ送信装置から、いずれかの送信データが送出された場合のデータの流れに関する特徴量を含む第２の特徴情報を保持する第２の情報保持手段と、（３）上記第２の情報保持手段が保持した第２の特徴情報と、上記第１の情報保持手段が保持している第１の特徴情報とを照合する処理を行う照合処理手段と、（４）上記照合処理手段の照合結果を利用して、上記データ送信装置が送信した送信データを推定する推定処理手段とを有することを特徴とする。 The information processing apparatus according to the first aspect of the present invention provides (1) first feature information including a feature amount relating to a data flow when each transmission data is transmitted from a data transmission apparatus holding a plurality of transmission data. First information holding means for holding, and (2) second information for holding second feature information including a feature quantity relating to a data flow when any of the transmission data is transmitted from the data transmission device. A collating process that performs a process of collating the holding means, and (3) the second feature information held by the second information holding means and the first feature information held by the first information holding means. And (4) estimation processing means for estimating transmission data transmitted by the data transmission device using the collation result of the collation processing means.

第２の本発明の情報処理プログラムは、コンピュータを、（１）複数の送信データを保持するデータ送信装置から、それぞれの送信データが送出される場合のデータの流れに関する特徴量を含む第１の特徴情報を保持する第１の情報保持手段と、（２）上記データ送信装置から、いずれかの送信データが送出された場合のデータの流れに関する特徴量を含む第２の特徴情報を保持する第２の情報保持手段と、（３）上記第２の情報保持手段が保持した第２の特徴情報と、上記第１の情報保持手段が保持している第１の特徴情報とを照合する処理を行う照合処理手段と、（４）上記照合処理手段の照合結果を利用して、上記データ送信装置が送信した送信データを推定する推定処理手段として機能させることを特徴とする。 An information processing program according to a second aspect of the invention includes a computer that includes (1) a feature amount relating to a data flow when each transmission data is transmitted from a data transmission device that holds a plurality of transmission data. A first information holding means for holding feature information; and (2) a second feature information for holding second feature information including a feature amount relating to a data flow when any of the transmission data is transmitted from the data transmission device. And (3) a process of collating the second feature information held by the second information holding unit and the first feature information held by the first information holding unit. And (4) using the collation result of the collation processing means to function as estimation processing means for estimating the transmission data transmitted by the data transmitting apparatus.

本発明によれば、データ送信装置がパケット単位に分割して送信データを送出した場合に、パケットのデータ部分を参照せずに、その送信データの内容を推定することができる。 According to the present invention, when the data transmitting apparatus transmits transmission data divided in units of packets, the contents of the transmission data can be estimated without referring to the data portion of the packet.

実施形態に関係する各種装置（実施形態のコンテンツ推定装置を含む）の接続関係などの説明図である。It is explanatory drawing, such as connection relation of the various apparatuses (including the content estimation apparatus of embodiment) related to embodiment. 実施形態に係るコンテンツ推定装置で処理されるコンテンツの構成例について示したブロック図である。It is the block diagram shown about the structural example of the content processed with the content estimation apparatus which concerns on embodiment. 実施形態に係るインデックス管理テーブルの内容例について示した説明図である。It is explanatory drawing shown about the example of the content of the index management table which concerns on embodiment. 実施形態に係る情報生成部の動作について示したフローチャートである。It is the flowchart shown about operation | movement of the information generation part which concerns on embodiment. 実施形態に係る推定処理部の動作について示したフローチャートである。It is the flowchart shown about operation | movement of the estimation process part which concerns on embodiment. 実施形態の変形例に係るインデックス管理テーブルの内容例について示した説明図である。It is explanatory drawing shown about the example of the content of the index management table which concerns on the modification of embodiment.

（Ａ）主たる実施形態
以下、本発明による情報処理装置及びプログラム一実施形態を、図面を参照しながら詳述する。なお、この実施形態の情報処理装置は、コンテンツ推定装置である。 (A) Main Embodiment Hereinafter, an information processing apparatus and a program according to an embodiment of the present invention will be described in detail with reference to the drawings. Note that the information processing apparatus of this embodiment is a content estimation apparatus.

（Ａ−１）実施形態の構成
図１は、この実施形態に関係する各種装置（実施形態のコンテンツ推定装置１０を含む）の接続関係などの説明図である。 (A-1) Configuration of Embodiment FIG. 1 is an explanatory diagram of the connection relationship of various devices (including the content estimation device 10 of the embodiment) related to this embodiment.

図１に示すコンテンツ推定装置１０は、ユーザ端末３０が送受信するパケットを観測（キャプチャ）して、当該ユーザ端末３０が、アクセスしたＷＥＢサーバ２０上のコンテンツ（たとえば、当該コンテンツのＵＲＬ）を推定するものである。 The content estimation apparatus 10 shown in FIG. 1 observes (captures) packets transmitted and received by the user terminal 30, and estimates the content (for example, the URL of the content) on the WEB server 20 accessed by the user terminal 30. Is.

ユーザ端末３０は、例えば、ＰＣ、携帯電話端末、ＰＤＡ等のＷＥＢブラウザを備える端末であるものとする。ユーザ端末３０としては、既存のＷｅｂブラウザを備える端末を適用することができる。 The user terminal 30 is assumed to be a terminal including a WEB browser such as a PC, a mobile phone terminal, or a PDA. As the user terminal 30, a terminal including an existing Web browser can be applied.

また、ＷＥＢサーバ２０は、ユーザ端末３０のアクセス要求に応じて、コンテンツＣ１〜Ｃ４のいずれかのデータを供給するものであるものとする。ＷＥＢサーバ２０についても既存のＷＥＢサーバを適用することができるため詳しい説明を省略する。ＷＥＢサーバ２０では、コンテンツＣ１〜Ｃ４のそれぞれに対するＵＲＬとして、Ｕ１〜Ｕ４が定義されており、ユーザ端末３０から通知されたＵＲＬに応じたコンテンツのデータをユーザ端末３０に対して供給する。なお、ＷＥＢサーバ２０の構成や格納するコンテンツ構成や数については限定されないものであり、既存の種々のＷＥＢサーバと同様のものを適用することができる。 Further, it is assumed that the WEB server 20 supplies any data of the contents C1 to C4 in response to an access request from the user terminal 30. Since an existing WEB server can be applied to the WEB server 20 as well, detailed description is omitted. In the WEB server 20, U 1 to U 4 are defined as URLs for the contents C 1 to C 4, and content data corresponding to the URL notified from the user terminal 30 is supplied to the user terminal 30. The configuration of the WEB server 20 and the content configuration and the number of contents to be stored are not limited, and the same ones as those of various existing WEB servers can be applied.

そして、ユーザ端末３０は、アクセスネットワークＮ２及びインターネットＮ１を介して、ＷＥＢサーバシステム２０にアクセスし、ＷＥＢサーバシステム２０上のコンテンツＣ１〜Ｃ４のいずれかの供給（ダウンロード）を受けるものとする。なお、この実施形態において、ＷＥＢサーバ２０やユーザ端末３０数や、各装置間のネットワーク構成については限定されないものである。 The user terminal 30 accesses the WEB server system 20 via the access network N2 and the Internet N1, and receives (downloads) any of the contents C1 to C4 on the WEB server system 20. In this embodiment, the number of WEB servers 20, the number of user terminals 30, and the network configuration between the devices are not limited.

また、以下では、ＷＥＢサーバ２０のＩＰアドレスをＳ１、ユーザ端末３０のＩＰアドレスをＴ１と表わすものとする。 Hereinafter, the IP address of the WEB server 20 is represented as S1, and the IP address of the user terminal 30 is represented as T1.

次に、ＷＥＢサーバ２０に格納された各コンテンツの構成例について説明する。 Next, a configuration example of each content stored in the WEB server 20 will be described.

図２は、ＷＥＢサーバ２０に格納されたコンテンツＣ１の構成について示したブロック図である。 FIG. 2 is a block diagram showing the configuration of the content C1 stored in the WEB server 20.

上述の通り、ＷＥＢサーバ２０に格納されるコンテンツの構成については限定されないものであるが、この実施形態では説明を簡易にするため、コンテンツＣ１は、図２に示すような構成となっているものとして以下の説明を行う。 As described above, the configuration of the content stored in the WEB server 20 is not limited. However, in this embodiment, the content C1 has a configuration as shown in FIG. 2 in order to simplify the description. The following will be described.

各コンテンツは、１又は複数の構成要素である要素コンテンツ（以下、「ＥＣ」とも表わす）を有しており、コンテンツＣ１の場合は、図２に示すように、本文（例えば、ＨＴＭＬ文やＸＭＬ文等）の要素コンテンツＥＣ１１と、本文（ＥＣ１１）から呼び出されるバナー広告（画像データ）の要素コンテンツＥＣ１２、及び本文（ＥＣ１１）から呼び出されるスタイルシートの要素コンテンツＥＣ１３が含まれている。 Each content has element content (hereinafter also referred to as “EC”) that is one or a plurality of constituent elements. In the case of the content C1, as shown in FIG. 2, a text (for example, an HTML sentence or XML) Element content EC11 of a sentence or the like, banner advertisement (image data) element content EC12 called from the text (EC11), and style sheet element content EC13 called from the text (EC11).

なお、コンテンツ推定装置１０では、各コンテンツを構成する各要素コンテンツに対して、識別子が管理されているものとする。この実施形態では、要素コンテンツＥＣ１１〜ＥＣ１３については、それぞれＥＣ１１〜ＥＣ１３という識別子が付与されているものとする。具体的には、コンテンツ推定装置１０では、各要素コンテンツにアクセスするためのＵＲＬに上述の識別子を対応付けて管理するようにしても良い。また、コンテンツ推定装置１０では、ＵＲＬ自体を識別子として管理し、各要素コンテンツに対して別途識別子を付与しないようにしても良い。 In the content estimation apparatus 10, it is assumed that identifiers are managed for each element content constituting each content. In this embodiment, it is assumed that identifiers EC11 to EC13 are assigned to the element contents EC11 to EC13, respectively. Specifically, the content estimation apparatus 10 may manage the above-described identifier in association with a URL for accessing each element content. Further, the content estimation apparatus 10 may manage the URL itself as an identifier, and do not assign a separate identifier to each element content.

次に、コンテンツ推定装置１０の構成の概要について説明する。 Next, an outline of the configuration of the content estimation device 10 will be described.

コンテンツ推定装置１０は、情報生成部１１、コンテンツ情報記憶部１２、及び推定処理部１３を有している。 The content estimation apparatus 10 includes an information generation unit 11, a content information storage unit 12, and an estimation processing unit 13.

通信装置１２０は、例えば、ＣＰＵ、ＲＯＭ、ＲＡＭ、ＥＥＰＲＯＭ、ハードディスクなどのプログラムの実行構成、及び、他の通信装置と通信をするためのインターフェースを有する装置（コンピュータ）に、実施形態の情報処理プログラム等をインストールすることにより構築されるものである。 The communication device 120 includes, for example, an information processing program according to the embodiment in an execution configuration of a program such as a CPU, ROM, RAM, EEPROM, and hard disk, and a device (computer) having an interface for communicating with other communication devices. It is constructed by installing etc.

情報生成部１１は、ＷＥＢサーバ２０上の各コンテンツ（Ｃ１〜Ｃ４）に対するアクセスを行い、各コンテンツを構成する要素コンテンツのデータをダウンロードした場合に観測されるデータ（パケット列）の流れ（フロー）に基づく情報（以下、「フロー情報」と呼ぶ）を取得する。この実施形態では、フロー情報には、各フローに対する統計情報（例えば、（ユーザ端末３０に対向するサーバのＩＰアドレスや、当該フローを構成するパケット列のパケット数等）等が含まれているものとする。そして、情報生成部１１は、取得したフロー情報に基づいて、当該統計情報に対応するコンテンツを検索するためのインデックス（見出し）となる情報（以下、「インデックス情報」と呼ぶ）を生成する。言い換えると、インデックス情報には、対応するコンテンツをダウンロードした場合のデータのフローに関する特徴量を示す情報が含まれている。 The information generation unit 11 accesses each content (C1 to C4) on the WEB server 20, and flows data (packet string) observed when downloading data of element contents constituting each content (flow). Information (hereinafter referred to as “flow information”) is acquired. In this embodiment, the flow information includes statistical information for each flow (for example, (the IP address of the server facing the user terminal 30, the number of packets in the packet sequence constituting the flow, etc.), etc. Based on the acquired flow information, the information generation unit 11 generates information (hereinafter referred to as “index information”) that serves as an index (heading) for searching for content corresponding to the statistical information. In other words, the index information includes information indicating the feature amount regarding the data flow when the corresponding content is downloaded.

また、情報生成部１１は、ＷＥＢサーバ２０上の各コンテンツ（Ｃ１〜Ｃ４）の内容を読み込んで、各コンテンツに関する情報（以下、「コンテンツ情報」と呼ぶ）を生成する。コンテンツ情報には、当該コンテンツの内容（キーワード、単語等）や、ＵＲＬ等の情報を含むようにしても良い。この実施形態では、説明を簡易にするため、コンテンツ情報には少なくとも当該コンテンツにアクセスするためのＵＲＬが含まれているものとして説明する。 Further, the information generation unit 11 reads the contents of the contents (C1 to C4) on the WEB server 20 and generates information related to the contents (hereinafter referred to as “content information”). The content information may include information on the content (keywords, words, etc.) and information such as URLs. In this embodiment, in order to simplify the description, it is assumed that the content information includes at least a URL for accessing the content.

そして、情報生成部１１は、インデックス情報とコンテンツ情報とを対応付けてコンテンツ情報記憶部１２のインデックス管理テーブル１２１に記録する。言い換えると、コンテンツ情報記憶部１２は、インデックス情報の一部又は全部の項目をキーとしてコンテンツ情報を検索することが可能なデータベースとして構成されている。この実施形態では、コンテンツ情報記憶部１２では、説明を簡易とするためにテーブル形式で、各インデックス情報に対応するコンテンツ情報を管理するものとして説明するが、具体的なデータ管理の方式はテーブル形式に限定されず、種々のデータベース形式を適用することも可能である。 Then, the information generation unit 11 records the index information and the content information in the index management table 121 of the content information storage unit 12 in association with each other. In other words, the content information storage unit 12 is configured as a database capable of searching for content information using some or all items of index information as keys. In this embodiment, the content information storage unit 12 is described as managing the content information corresponding to each index information in a table format for simplicity of explanation, but a specific data management method is a table format. The present invention is not limited to this, and various database formats can be applied.

そして、推定処理部１３は、ユーザ端末３０とＷＥＢサーバ２０との間のフロー情報を取得し、取得したフロー情報に基づいて、インデックス作成部１１３と同様の処理によりインデックス情報と一部又は全部の項目が共通する情報（以下、「検索対象インデックス情報」と呼ぶ）を生成する。言い換えると、検索対象インデックス情報には、インデックス情報と同様に、観測したデータのフローに関する特徴量を示す情報が含まれている。 Then, the estimation processing unit 13 acquires flow information between the user terminal 30 and the WEB server 20, and based on the acquired flow information, the index information and a part or all of the index information are processed by the same process as the index creation unit 113. Information with common items (hereinafter referred to as “search target index information”) is generated. In other words, similar to the index information, the search target index information includes information indicating the characteristic amount related to the observed data flow.

そして、推定処理部１３は、生成した検索対象インデックス情報と、コンテンツ情報記憶部１２（インデックス管理テーブル１２１）の各インデックス情報を照合する処理を行う。推定処理部１３は、その照合結果に基づいて、ユーザ端末３０がアクセスしたコンテンツを推定する処理等を行う。言い換えると、推定処理部１３は、検索対象インデックス情報と、フローに関する特徴量が所定の範囲内で一致するインデックス情報を検出する処理を行う。 And the estimation process part 13 performs the process which collates the produced | generated search object index information and each index information of the content information storage part 12 (index management table 121). The estimation processing unit 13 performs processing for estimating the content accessed by the user terminal 30 based on the collation result. In other words, the estimation processing unit 13 performs a process of detecting index information in which the search target index information matches the flow-related feature amount within a predetermined range.

次に、コンテンツ推定装置１０の構成の詳細について説明する。 Next, details of the configuration of the content estimation apparatus 10 will be described.

まず、情報生成部１１について説明する。 First, the information generation unit 11 will be described.

情報生成部１１は、フロー情報取得部１１１、クローリング処理部１１２、及びインデックス作成部１１３を有している。 The information generation unit 11 includes a flow information acquisition unit 111, a crawling processing unit 112, and an index creation unit 113.

クローリング処理部１１２は、所定のコンテンツ（例えば、ユーザの行動傾向を取得するための対象となるコンテンツ）に係るＵＲＬを指定して順次アクセスする処理を行う。すなわち、クローリング処理部１１２は、ユーザ端末３０がＷＥＢサーバ２０上の各コンテンツにアクセスした場合と同様の処理（ネットワーク上のフロー（トラフィック）を発生させる処理）を行う。 The crawling processing unit 112 performs processing of sequentially specifying and specifying a URL related to predetermined content (for example, content that is a target for acquiring a user's behavior tendency). That is, the crawling processing unit 112 performs the same processing (processing that generates a flow (traffic) on the network) when the user terminal 30 accesses each content on the WEB server 20.

この実施形態では、クローリング処理部１１２には、予めアクセスするＵＲＬのリストが登録されているものとして説明するが、クローリング処理部１１２が、アクセスするＵＲＬを保持する方法は限定されないものである。例えば、クローリング処理部１１２に一つのＵＲＬだけを設定して、当該ＵＲＬのコンテンツでリンクされているコンテンツを順次クローリング処理部１１２がアクセスしていくようにしても良い。 In this embodiment, the description will be made assuming that a list of URLs to be accessed is registered in advance in the crawling processing unit 112, but the method of holding the URLs to be accessed by the crawling processing unit 112 is not limited. For example, only one URL may be set in the crawling processing unit 112, and the crawling processing unit 112 may sequentially access content linked by the content of the URL.

そして、クローリング処理部１１２は、一つのコンテンツ（ページ）についてアクセスすると、まず、そのコンテンツの本文に係る要素コンテンツをダウンロードし、本文の記述を分析して当該コンテンツに含まれるその他の要素コンテンツ（例えば、バナーやスタイルシート等）を検索し、検索した要素コンテンツに一つずつアクセス（ダウンロード）する処理を行う。一つのコンテンツが、この時、本文以外の文書、書式、画像等複数の要素コンテンツを含む場合には、クローリング処理部１１２は、それらを複数のフローとして分離して取得するために、充分な間隔を置いてコンテンツの各要素にアクセスをする。また、クローリング処理部１１２は、当該コンテンツにアクセスしたときに、当該コンテンツに係るコンテンツ情報を保持して、インデックス作成部１１３に供給する。この実施形態では、クローリング処理部１１２が保持するコンテンツ情報に、少なくともＵＲＬが含まれるものとする。 When the crawling processing unit 112 accesses one content (page), the crawling processing unit 112 first downloads the element content related to the body of the content, analyzes the description of the body and analyzes other element contents (for example, the content) , A banner, a style sheet, etc.) and a process of accessing (downloading) the searched element contents one by one. When one content includes a plurality of element contents such as a document other than the main body, a format, and an image at this time, the crawling processing unit 112 has a sufficient interval for separating and acquiring them as a plurality of flows. To access each element of the content. Further, when accessing the content, the crawling processing unit 112 holds content information related to the content and supplies the content information to the index creating unit 113. In this embodiment, it is assumed that the content information held by the crawling processing unit 112 includes at least a URL.

フロー情報取得部１１１は、クローリング処理部１１２によるコンテンツへのアクセス要求に伴って、クローリング処理部１１からＷＥＢサーバ２０へ流れるデータ（パケット列）の流れ（フロー）を観測し、その観測結果に基づいてフロー情報を取得する。そしてフロー情報取得部１１１は、得られたフロー情報をインデックス作成部１１３に引き渡す。 The flow information acquisition unit 111 observes the flow (flow) of data (packet string) that flows from the crawling processing unit 11 to the WEB server 20 in response to a content access request from the crawling processing unit 112, and based on the observation result. Flow information. Then, the flow information acquisition unit 111 passes the obtained flow information to the index creation unit 113.

具体的には、フロー情報取得部１１１は、クローリング処理部１１２が各要素コンテンツについてダウンロードした場合のフローを観測してフローごとに、フロー情報を取得する。したがって、フロー情報取得部１１１は、１つの要素コンテンツについて１つのフロー情報を取得することになる。フロー情報に含まれる情報の項目数や種類の組み合わせについては限定されないものであるが、ここでは、当該フローのデータ送信元の識別子（アドレス）と、当該フローを構成するパケットのパケット数の情報とが含まれるものとして説明する。 Specifically, the flow information acquisition unit 111 observes a flow when the crawling processing unit 112 downloads each element content, and acquires flow information for each flow. Therefore, the flow information acquisition unit 111 acquires one flow information for one element content. The number of information items included in the flow information and the combination of types are not limited, but here, the identifier (address) of the data transmission source of the flow, the information on the number of packets of the packets constituting the flow, Will be described as being included.

フロー情報取得部１１１が、フロー情報を取得する具体的な構成については限定されないものであるが、例えば、ＮｅｔＦｌｏｗ（参考文献１（ＩＥＴＦＲＦＣ３９５４）参照）、ＩＰＦＩＸ（参考文献２（ＩＥＴＦＲＦＣ５１０１）参照）等の従来技術における、エクスポータ（エージェント）及びコレクタの処理構成を適用することができるので詳しい説明を省略する。 The specific configuration in which the flow information acquisition unit 111 acquires the flow information is not limited. For example, NetFlow (see Reference Document 1 (IETF RFC3954)), IPFIX (see Reference Document 2 (IETF RFC5101)) Since the processing configuration of the exporter (agent) and the collector in the prior art such as the above can be applied, detailed description is omitted.

なお、フロー情報取得部１１１については、コンテンツ推定装置１０自体に搭載する必要はない。例えば、フロー情報取得部１１１は、クローリング処理部１１２とＷＥＢサーバ２０との間のネットワーク経路上に設置された中継装置（ルータ等）に搭載し、コンテンツ推定装置１０は、統計情報の供給を受けるようにしても良い。 Note that the flow information acquisition unit 111 need not be installed in the content estimation apparatus 10 itself. For example, the flow information acquisition unit 111 is mounted on a relay device (router or the like) installed on a network path between the crawling processing unit 112 and the WEB server 20, and the content estimation device 10 receives supply of statistical information. You may do it.

インデックス作成部１１３は、フロー情報取得部１１１から供給されたフロー情報に基づいて１又は複数のインデックス情報を生成して、コンテンツ情報に、コンテンツ情報記憶部１２（インデックス管理テーブル１２１）を対応付けて登録する。 The index creation unit 113 generates one or a plurality of index information based on the flow information supplied from the flow information acquisition unit 111, and associates the content information with the content information storage unit 12 (index management table 121). sign up.

インデックス作成部１１３が生成する各インデックス情報には、フロー情報を構成する項目の情報や、フローの本数等が検索に係るキー情報として含まれているものとする。 It is assumed that each index information generated by the index creation unit 113 includes information on items constituting the flow information, the number of flows, and the like as key information related to the search.

図３は、この実施形態のインデックス作成部１１３が生成するインデックス情報及びコンテンツ情報が入力されたインデックス管理テーブルの内容例について示している。 FIG. 3 shows an example of the contents of the index management table to which the index information and content information generated by the index creation unit 113 of this embodiment are input.

図３では、インデックス管理テーブル１２１に登録される内容のうち、コンテンツＣ１について、クローリング処理部１１２がアクセスした場合に取得されるフロー情報に基づいて登録された内容を抽出して示している。 In FIG. 3, among the contents registered in the index management table 121, the contents registered based on the flow information acquired when the crawling processing unit 112 accesses the contents C1 are shown.

図３に示す通り、インデックス管理テーブル１２１では、１つのフロー情報に基づいて生成された１又は複数のインデックス情報が、コンテンツ情報に対応付けて登録されている。具体的には、図３に示すインデックス管理テーブル１２１では、１つのインデックス情報と、当該インデックス情報の識別子としてのＩＤと、当該インデックス情報に対応するコンテンツ情報とを含む情報（以下、「インデックス管理情報」と呼ぶ）が１行で表わされている。すなわち、この実施形態のインデックス作成部１１３は、生成したインデックス情報ごとに、当該インデックス情報に基づくインデックス管理情報を生成して、インデックス管理テーブル１２１に登録する処理を行う。 As shown in FIG. 3, in the index management table 121, one or a plurality of index information generated based on one flow information is registered in association with content information. Specifically, in the index management table 121 shown in FIG. 3, information including one index information, an ID as an identifier of the index information, and content information corresponding to the index information (hereinafter referred to as “index management information”). ") Is represented by one line. That is, the index creation unit 113 of this embodiment performs processing for generating index management information based on the index information and registering it in the index management table 121 for each generated index information.

図３では、７行で７つのインデックス管理情報が登録された例について示しており、ＩＤがＲ０〜Ｒ６のインデックス管理情報に対して、それぞれＫ１０〜Ｋ１６という符号を付している。例えば、ＩＤがＲ０のインデックス管理情報Ｋ１０となる。 FIG. 3 shows an example in which seven index management information items are registered in seven rows, and reference numerals K10 to K16 are assigned to the index management information items having IDs R0 to R6, respectively. For example, the index management information K10 whose ID is R0.

次に、インデックス管理テーブル１２１において、インデックス管理テーブル１２１を構成するインデックス管理情報の詳細について説明する。 Next, details of the index management information constituting the index management table 121 in the index management table 121 will be described.

図３に示すように、インデックス管理テーブル１２１を構成するそれぞれのインデックス情報には、「アドレス」、「フロー数」、「パケット数」、「派生タイプ」、「元のフロー数」の項目の情報が含まれている。 As shown in FIG. 3, each index information constituting the index management table 121 includes information on items of “address”, “number of flows”, “number of packets”, “derived type”, and “number of original flows”. It is included.

「アドレス」は当該インデックス情報に係るフローのデータ供給元の識別子（アドレス）を示している。ここでは、全てのインデックス情報は、ＷＥＢサーバ２０を供給元とするデータのフローに係るものであるので、アドレスの項目は、全て、ＷＥＢサーバ２０にアクセスするためのアドレス「Ｓ１」が入力される。 “Address” indicates an identifier (address) of the data supply source of the flow related to the index information. Here, since all the index information relates to the data flow with the WEB server 20 as the supply source, the address “S1” for accessing the WEB server 20 is input to all the address items. .

「フロー数」の項目は、当該インデックス情報に係るフローの本数を示している。 The item “number of flows” indicates the number of flows related to the index information.

「パケット数」の項目は、フローごとのパケット数の合計を示している。 The item “number of packets” indicates the total number of packets for each flow.

なお、以下では、要素コンテンツＥＣ１１を単独のフローとしてダウンロードした場合のパケット数をＰ１１、要素コンテンツＥＣ１２を単独のフローとしてダウンロードした場合のパケット数をＰ１２、要素コンテンツＥＣ１３を単独のフローとしてダウンロードした場合のパケット数をＰ１３であるものとする。 In the following, the number of packets when the element content EC11 is downloaded as a single flow is P11, the number of packets when the element content EC12 is downloaded as a single flow is P12, and the element content EC13 is downloaded as a single flow. Let P13 be the number of packets.

「派生タイプ」の項目は、当該インデックス情報が、取得したフロー情報をそのまま反映したインデックス情報であるのか否か等を示している。 The item “derivation type” indicates whether or not the index information is index information that directly reflects the acquired flow information.

上述の通り、クローリング処理部１１２では、各コンテンツを構成する要素コンテンツについては、それぞれ単独にアクセスしている。すなわち、クローリング処理部１１２では、各要素コンテンツについて別個のフローとして観測されるようにアクセス制御を行っているが、実際にユーザ端末３０が当該コンテンツにアクセスする場合には、１部又は全部の要素コンテンツについて同時にアクセス（ダウンロード）する場合も考えられる。そして、ユーザ端末３０が、複数の要素コンテンツについて同時にアクセス（ダウンロード）した場合には、それらのフローは一つのフローに結合して観測されることになる。これは、実際にユーザ端末３０がそのコンテンツにアクセスした場合に、複数の要素コンテンツに連続してアクセスした結果、一つのフローとしてフロー情報取得部１３１で観測される可能性があるためである。 As described above, the crawling processing unit 112 accesses each element content constituting each content independently. That is, in the crawling processing unit 112, access control is performed so that each element content is observed as a separate flow. However, when the user terminal 30 actually accesses the content, one or all elements are accessed. It is also possible to access (download) content at the same time. When the user terminal 30 accesses (downloads) a plurality of element contents at the same time, these flows are combined and observed as one flow. This is because when the user terminal 30 actually accesses the content, the flow information acquisition unit 131 may observe the flow as a single flow as a result of continuous access to a plurality of element contents.

さらに、複数のコンテンツで共通して利用される要素コンテンツ（例えば、スタイルシート等）があった場合には、当該要素コンテンツについては、ユーザ端末３０側でキャッシュされる場合がある。ユーザ端末３０側でキャッシュされた要素コンテンツがあった場合、ユーザ端末３０から当該要素コンテンツを含むコンテンツへアクセスしても、キャッシュされた要素コンテンツについてはダウンロードが行われないことになる。すなわち、ユーザ端末３０側でキャッシュされた要素コンテンツの有無に応じて、観測されるフロー情報が異なる場合がある。 Furthermore, when there is an element content (for example, a style sheet) that is commonly used for a plurality of contents, the element content may be cached on the user terminal 30 side. When there is element content cached on the user terminal 30 side, even if the user terminal 30 accesses content including the element content, the cached element content is not downloaded. That is, the observed flow information may differ depending on the presence / absence of element content cached on the user terminal 30 side.

したがって、この実施形態のインデックス作成部１１３では、例として、スタイルシートを要素コンテンツに含むコンテンツについては、当該要素コンテンツのダウンロードを省略した場合のインデックス情報も生成されることになる。インデックス作成部１１３において、各要素コンテンツが、スタイルシートであるか否かは、例えば、拡張子や、元のコンテンツのどの構成部分から呼び出されているか等を確認（例えば、スタイルシート指定部分からのリンクであったことの確認等）することにより容易に判別することができる。 Therefore, in the index creation unit 113 of this embodiment, as an example, for content including a style sheet in element content, index information when the download of the element content is omitted is also generated. In the index creation unit 113, whether or not each element content is a style sheet is confirmed by, for example, an extension or from which component part of the original content is called (for example, from a style sheet designation part) It can be easily determined by confirming that the link has been made.

そこで、この実施形態の情報生成部１１では、全ての要素コンテンツについて単独のフローでダウンロードした場合のインデックス情報を基準となるインデックス情報（以下、「基準インデックス情報」と呼ぶ）と、基準インデックス情報から派生したインデックス情報（以下、「派生インデックス情報」と呼ぶ）とを管理するために、インデックス管理テーブル１２１で、上述の「派生タイプ」の項目を設けている。 Therefore, in the information generation unit 11 of this embodiment, the index information when all element contents are downloaded in a single flow is used as reference index information (hereinafter referred to as “reference index information”) and the reference index information. In order to manage the derived index information (hereinafter referred to as “derived index information”), the above-described “derived type” item is provided in the index management table 121.

図３では、基準インデックス情報については、派生タイプの項目に「元データ」と入力し、派生インデックス情報については、派生タイプの項目に、各フローを構成する要素コンテンツの組合せを示す情報を入力している。 In FIG. 3, “reference data” is input in the derived type field for the reference index information, and information indicating a combination of element contents constituting each flow is input in the derived type item for the derived index information. ing.

例えば、図３において、ＩＤが「Ｒ０」のインデックス管理情報Ｋ１０を構成するインデックス情報は、基準インデックス情報として管理されている。図３に示す、インデックス管理情報Ｋ１０のインデックス情報（基準インデックス情報）では、パケット数の項目に、要素コンテンツＥＣ１１〜ＥＣ１３に対応する３つのフローに関するパケット数「Ｐ１１」、「Ｐ１２」、「Ｐ１３」が入力されている。 For example, in FIG. 3, the index information constituting the index management information K10 with the ID “R0” is managed as the reference index information. In the index information (reference index information) of the index management information K10 shown in FIG. 3, the number of packets “P11”, “P12”, “P13” relating to the three flows corresponding to the element contents EC11 to EC13 is included in the packet number item. Is entered.

一方、図３では、ＩＤが「Ｒ１」のインデックス管理情報Ｋ１１を構成するインデックス情報は、派生インデックスとして管理されている。具体的には、インデックス管理情報Ｋ１１のインデックス情報は、要素コンテンツＥＣ１１、ＥＣ１２が１つのフローでダウンロードされ、要素コンテンツＥＣ１３は単独のフローでダウンロードされた場合の派生インデックス情報として登録されている。そして、図３に示す、インデックス管理情報Ｋ１１のインデックス情報（基準インデックス情報）では、パケット数の項目に、２つのフローに関するパケット数の情報「Ｐ１１＋Ｐ１２」、「Ｐ３」が入力されている。パケット数の項目のうち「Ｐ１１＋Ｐ１２」は、要素コンテンツＥＣ１１、ＥＣ１２が１つのフローでダウンロードされた場合のフローを構成するパケット数を示している。また、パケット数の項目のうち「Ｐ３」は、要素コンテンツＥＣ１３が単独のフローでダウンロードされた場合のフローを構成するパケット数を示している。そして、派生タイプの項目に、上述の２つのフローのそれぞれを構成する要素コンテンツの組合せに関する情報として、「ＥＣ１１＋ＥＣ１２」、「ＥＣ１３」という２つの情報が入力されている。派生タイプの項目のうち「ＥＣ１１＋ＥＣ１２」は、当該フローは、２つの要素コンテンツＥＣ１１、ＥＣ１２のデータで構成されていることを示している。そして、派生タイプの項目のうち「ＥＣ１３」は、当該フローは、１つの要素コンテンツＥＣ１３のデータで構成されていることを示している。 On the other hand, in FIG. 3, the index information constituting the index management information K11 with the ID “R1” is managed as a derived index. Specifically, the index information of the index management information K11 is registered as derived index information when the element contents EC11 and EC12 are downloaded in one flow and the element content EC13 is downloaded in a single flow. In the index information (reference index information) of the index management information K11 shown in FIG. 3, information “P11 + P12” and “P3” on the number of packets related to two flows are input in the item of the number of packets. Among the items of the number of packets, “P11 + P12” indicates the number of packets constituting the flow when the element contents EC11 and EC12 are downloaded in one flow. Also, “P3” in the item of the number of packets indicates the number of packets constituting the flow when the element content EC13 is downloaded as a single flow. Then, in the derived type item, two pieces of information “EC11 + EC12” and “EC13” are input as information relating to the combination of element contents constituting each of the two flows described above. Of the derived type items, “EC11 + EC12” indicates that the flow is composed of data of two element contents EC11 and EC12. In the derivation type item, “EC13” indicates that the flow includes data of one element content EC13.

なお、基準インデックス情報に係るインデックス管理情報と、派生インデックス情報に係るインデックス管理情報との対応関係については、別途項目を設けてポインタ等により相互にリンクさせる管理を行うようにしても良い。図３ではインデックス管理情報間の矢印により、上述のリンクを図示している。図３では、インデックス管理情報Ｋ１０を中心として、インデックス管理情報Ｋ１１〜Ｋ１６との間でリンクが張られた構成となっている。これにより、インデックス管理テーブル１２１（インデックス管理情報）を更新する際に整合性を保つこと等が容易になる。 In addition, regarding the correspondence relationship between the index management information related to the reference index information and the index management information related to the derived index information, management may be performed in which separate items are provided and linked to each other using a pointer or the like. In FIG. 3, the above-mentioned link is illustrated by arrows between the index management information. In FIG. 3, the index management information K10 is the center, and a link is established between the index management information K11 to K16. This facilitates maintaining consistency when updating the index management table 121 (index management information).

「元のフロー」の項目は、当該インデックス情報に係る基準インデックス情報のフロー数を示している。言い換えると、元のフロー数の項目は、対応するコンテンツのフローについて、フロー情報取得部１１１で観測されたときのフロー数がそのまま表された値となる。したがって、基準インデックス情報については、「フロー数」の項目と「元のフロー数」の項目が同じ値となる。 The item “original flow” indicates the number of flows of reference index information related to the index information. In other words, the item of the original flow number is a value that directly represents the number of flows when the flow of the corresponding content is observed by the flow information acquisition unit 111. Accordingly, for the reference index information, the item “number of flows” and the item “original number of flows” have the same value.

次に、推定処理部１３の詳細について説明する。 Next, details of the estimation processing unit 13 will be described.

推定処理部１３は、フロー情報取得部１３１、コンテンツ情報検索部１３２、信頼度算出部１３３、及び出力部１３４を有している。 The estimation processing unit 13 includes a flow information acquisition unit 131, a content information search unit 132, a reliability calculation unit 133, and an output unit 134.

フロー情報取得部１３１は、ＷＥＢサーバ２０とユーザ端末３０との間を流れるデータ（パケット列）のフローを観測して、フロー情報を取得し、コンテンツ情報検索部１３２に供給する。ユーザ端末３０からＷＥＢサーバ２０にいずれかのコンテンツに対するアクセス要求が行われると、ＷＥＢサーバ２０からユーザ端末３０へ当該コンテンツのデータがユーザ端末３０に供給されるため、フロー情報取得部１３１は、このデータのフローを観測してフロー情報を取得する。なお、フロー情報取得部１３１は、上述のフロー情報取得部１１１と同様の処理でフロー情報を取得することが望ましい。また、フロー情報取得部１３１では、例えば、フロー情報を取得する対象のプロトコルを所定のプロトコルに絞るようにしてもよい。例えば、ＷＥＢサーバ２０からユーザ端末３０へはＨＴＴＰプロトコルでのみコンテンツのデータが供給される場合には、フロー情報取得部１３１はＨＴＴＰプロトコルに絞ってフロー情報を取得するようにしても良い。また、フロー情報取得部１３１は、アドレス単位（例えば、パケットの送信元アドレス及び又は送信先アドレス）で、観測対象とするパケット絞り込んで観測するようにしてもよい。 The flow information acquisition unit 131 observes the flow of data (packet string) flowing between the WEB server 20 and the user terminal 30, acquires flow information, and supplies the flow information to the content information search unit 132. When an access request for any content is made from the user terminal 30 to the WEB server 20, data of the content is supplied from the WEB server 20 to the user terminal 30 to the user terminal 30. Observe the flow of data and obtain flow information. Note that the flow information acquisition unit 131 preferably acquires flow information by the same processing as that of the flow information acquisition unit 111 described above. Further, in the flow information acquisition unit 131, for example, the protocol for which the flow information is acquired may be narrowed down to a predetermined protocol. For example, when content data is supplied from the WEB server 20 to the user terminal 30 only by the HTTP protocol, the flow information acquisition unit 131 may acquire the flow information by focusing on the HTTP protocol. In addition, the flow information acquisition unit 131 may narrow down and observe a packet to be observed in an address unit (for example, a transmission source address and / or a transmission destination address of a packet).

コンテンツ情報検索部１３２は、フロー情報取得部１３１によって得られたフロー情報に基づいて、インデックス情報と同様の項目の情報（ただし、「派生タイプ」、「元のフロー数」の項目は除く）を「検索対象インデックス情報」として生成する。すなわち、この実施形態の検索対象インデックス情報には、「アドレス」、「フロー情報」、「パケット数」の項目の情報が含まれる。 Based on the flow information obtained by the flow information acquisition unit 131, the content information search unit 132 retrieves information on items similar to the index information (except for items of “derived type” and “original flow number”). Generated as “search target index information”. That is, the search target index information of this embodiment includes information on items of “address”, “flow information”, and “number of packets”.

そして、コンテンツ情報検索部１３２は、生成した検索対象インデックス情報と、コンテンツ情報記憶部１２のインデックス管理テーブル１２１の各インデックス情報とを照合する処理を行い、所定の範囲内で一致すると認められるインデックス情報に対応するコンテンツ情報を検出する。 Then, the content information search unit 132 performs a process of comparing the generated search target index information with each index information of the index management table 121 of the content information storage unit 12, and index information recognized as matching within a predetermined range. Content information corresponding to is detected.

このとき、コンテンツ情報検索部１３２は、インデックス管理テーブル１２１のインデックス情報のうち、検索対象インデックス情報と、アドレス、フロー数、及び各フローのパケット数が一致すると認められるものだけを検出する。そして、各フローのパケット数については、完全に一致しなくても、動的コンテンツで変動がある可能性を見越して、あらかじめ定められた許容範囲内の誤差であれば一致するものとして検出するものとする。この実施形態では、コンテンツ情報検索部１３２は、例として、検索対象インデックス情報のパケット数が、インデックス管理テーブル１２１のパケット数の±３％の範囲内であれば、一致するものとみなすように判断するようにしても良い。なお、コンテンツ情報検索部１３２でパケット数に対して許容する誤差の範囲を何％にするかは上述の例に限定されないものである。 At this time, the content information search unit 132 detects only the index information in the index management table 121 that is found to match the search target index information, the address, the number of flows, and the number of packets of each flow. Even if the number of packets in each flow does not completely match, in anticipation of the possibility of fluctuation in dynamic content, an error within a predetermined tolerance is detected as matching. And In this embodiment, as an example, the content information search unit 132 determines to match if the number of packets of the search target index information is within ± 3% of the number of packets in the index management table 121. You may make it do. It should be noted that the percentage of the error range allowed for the number of packets by the content information search unit 132 is not limited to the above example.

以上のように、コンテンツ情報検索部１３２は、インデックス管理テーブル１２１から、検索対象インデックス情報と一致すると認められるインデックス情報を検出すると、当該インデックス情報と、当該インデックス情報に対応するコンテンツ情報を、信頼度算出部１３３に供給する。 As described above, when the content information search unit 132 detects, from the index management table 121, index information that is found to match the search target index information, the content information search unit 132 converts the index information and the content information corresponding to the index information to reliability. It supplies to the calculation part 133.

信頼度算出部１３３では、検索対象インデックス情報に係るコンテンツが、検出されたインデックス情報に対応するコンテンツ情報であることの信頼度を示す値を算出する。コンテンツ情報検索部１３２では、複数のコンテンツ情報（インデックス情報）が検出されることも考えられるため、そのような場合等に対応するために、信頼度算出部１３３では、コンテンツ情報検索部１３２の検出結果に応じた信頼度を算出する処理を行う。例えば同じＷＥＢサーバ上であればＩＰアドレスは同じになり、たまたま同じサイズで１コンテンツあたりのフロー数も同じ文書であればパケット数等も同程度になる場合が有り得るからである。 The reliability calculation unit 133 calculates a value indicating the reliability that the content related to the search target index information is content information corresponding to the detected index information. Since the content information search unit 132 may detect a plurality of pieces of content information (index information), the reliability calculation unit 133 detects the content information search unit 132 in order to cope with such a case. A process of calculating the reliability according to the result is performed. For example, the IP addresses are the same if they are on the same WEB server, and if the documents have the same size and the same number of flows per content, the number of packets may be the same.

この実施形態では、信頼度算出部１３３は、例として、検出されたコンテンツ情報（インデックス情報）の数と、検出されたインデックス情報のフロー数の差（当該コンテンツ情報の「元のフロー数」の値と「フロー数」との差分）に応じて重み付けした値を信頼度として取り扱うものとする。 In this embodiment, as an example, the reliability calculation unit 133 determines the difference between the number of detected content information (index information) and the number of detected index information flows (the “original flow number” of the content information). The value weighted according to the difference between the value and the “number of flows”) is handled as the reliability.

コンテンツ情報検索部１３２で、多数該当するコンテンツ情報（インデックス情報）が検出された場合は、１つのみ該当するコンテンツ情報が検出された場合に比べて実際にユーザ端末３０がアクセスしたコンテンツ以外の情報も多く含まれることになるため、そのような場合は信頼度が低いと推測される。また、フロー情報取得部１１１で取得された時のフロー数と、フロー情報取得部１３１で取得された時のフロー数が近いほど、ユーザ端末３０は同じコンテンツ（同じＵＲＬ）にアクセスしていた可能性が高いものと考えられる。そこで、コンテンツ情報検索部１３２では、上述のような要素を考慮した値を、信頼度を示す値として算出し、検出されたコンテンツ情報（インデックス情報）を用いた以後の処理（例えば、データマイニングの処理等）に役立てることができる。 When a large number of corresponding content information (index information) is detected by the content information search unit 132, information other than the content actually accessed by the user terminal 30 compared to the case where only one corresponding content information is detected. In such a case, it is estimated that the reliability is low. Further, the closer the number of flows acquired by the flow information acquisition unit 111 and the number of flows acquired by the flow information acquisition unit 131 are, the more likely that the user terminal 30 has accessed the same content (the same URL). It is considered that the nature is high. In view of this, the content information search unit 132 calculates a value that takes the above factors into consideration as a value indicating reliability, and performs subsequent processing using the detected content information (index information) (for example, data mining). Processing).

以下では、コンテンツ情報検索部１３２で検出されたコンテンツ情報（インデックス情報）の数を「ｄｆ」、信頼度を算出する対象のコンテンツ情報（インデックス情報）に係る「元のフロー数」を「ｆ０」、信頼度を算出する対象のコンテンツ情報（インデックス情報）に係る「フロー数」（検索対象インデックス情報のフロー数）を「ｆ１」とした場合、例えば、信頼度を示す値Ａは以下の（１）式で表わすことができる。以下の（１）式では、信頼度Ａが高いほど、当該コンテンツ情報（インデックス情報）に係る信頼度は高いことを示している。 In the following, the number of content information (index information) detected by the content information search unit 132 is “df”, and the “original flow number” related to the content information (index information) whose reliability is to be calculated is “f0”. When the “number of flows” (number of flows of search target index information) related to the content information (index information) for which reliability is calculated is “f1”, for example, the value A indicating the reliability is (1 ) Expression. The following equation (1) indicates that the higher the reliability A, the higher the reliability related to the content information (index information).

信頼度Ａ＝ｆ１／（ｆ０×√ｄｆ） …（１）
上記の（１）式では、ｄｆが大きいほど信頼度Ａが小さい値となる傾向にある。 Reliability A = f1 / (f0 × √df) (1)
In the above equation (1), the reliability A tends to be smaller as df is larger.

また、上記の（１）式では、ｆ１が小さいほど信頼度Ａが小さい値となる傾向にある。ｆ０は当該コンテンツについて全ての要素コンテンツが別個のフローでダウンロードされた場合のフロー数を示しているので、基本的にｆ０≧ｆ１となる。そして、ｆ０＞ｆ１の場合には、少なくとも１以上の要素コンテンツに係るフローが結合して１つのフローとなった状態を示しているので、結合したフローが多くなるほど、ｆ１の値は小さくなることになる。そこで、上記の（１）式では、結合したフローが多くなるほど、当該インデックス情報に関する信頼度が低くなることを考慮して、ｆ１が小さいほど信頼度Ａが小さい値となる傾向となるようにしている。 Further, in the above equation (1), the reliability A tends to be smaller as f1 is smaller. Since f0 indicates the number of flows when all element contents are downloaded in a separate flow for the content, f0 ≧ f1 is basically satisfied. In the case of f0> f1, since the flows related to at least one element content are combined into one flow, the value of f1 decreases as the combined flows increase. become. Therefore, in the above equation (1), in consideration of the fact that the reliability associated with the index information decreases as the number of combined flows increases, the reliability A tends to decrease as f1 decreases. Yes.

なお、上記の（１）式は、信頼度の値を算出するための一例であり、上述の傾向に対応していれば、具体的な計算式は限定されず他の計算式を用いるようにしても良い。 Note that the above formula (1) is an example for calculating the reliability value, and if it corresponds to the above-mentioned tendency, the specific calculation formula is not limited and other calculation formulas should be used. May be.

出力部１３４は、得られた単数もしくは複数のコンテンツ情報（ＵＲＬ情報）と信頼度の値を出力する。出力部１３４が出力する形式は限定されないものであるが、単に、得られたコンテンツ情報（ＵＲＬ情報）と信頼度の情報の一覧（例えば、スプレッドシートやテキストデータ等）のデータを出力するようにしても良い。出力部１３４がデータを出力する先は限定されないものであり、例えば、ハードディスクやＤＶＤ−Ｒ等の記録媒体としても良いし、通信により他の装置に出力するようにしても良い。具体的には、出力部１３４は、例えば、頻繁にアクセスされるＵＲＬの情報から、ユーザの嗜好を推定するための処理等を行う情報処理装置（例えば、データマイニングを行うサーバ装置等）に出力するようにしても良い。 The output unit 134 outputs the obtained content information (URL information) and reliability values. The format output by the output unit 134 is not limited, but simply outputs the data of the obtained content information (URL information) and a list of reliability information (for example, spreadsheet, text data, etc.). May be. The destination to which the output unit 134 outputs data is not limited. For example, the output unit 134 may be a recording medium such as a hard disk or a DVD-R, or may be output to another device by communication. Specifically, the output unit 134 outputs, for example, to an information processing device (for example, a server device that performs data mining) that performs processing for estimating user preferences from frequently accessed URL information. You may make it do.

（Ａ−２）実施形態の動作
次に、以上のような構成を有するコンテンツ推定装置１０の動作を説明する。 (A-2) Operation | movement of embodiment Next, operation | movement of the content estimation apparatus 10 which has the above structures is demonstrated.

ここでは、まず、情報生成部１１の動作について、図４のフローチャートを用いて説明する。具体的には、以下では、情報生成部１１がＷＥＢサーバ２０のコンテンツＣ１（上述の図２）にアクセスして捕捉したフロー情報に基づいてインデックス情報を作成し、コンテンツ情報記憶部１２（インデックス管理テーブル１２１）に登録するまでの動作について説明する。 Here, first, the operation of the information generation unit 11 will be described with reference to the flowchart of FIG. Specifically, below, the information generation unit 11 creates index information based on the flow information captured by accessing the content C1 (FIG. 2 described above) of the WEB server 20, and the content information storage unit 12 (index management) The operation until registration in the table 121) will be described.

また、上述の通り、図２に示すコンテンツＣ１のＵＲＬはＵ１であり、アドレスＳ１のＷＥＢサーバ２０に格納されているものとする。 Further, as described above, it is assumed that the URL of the content C1 shown in FIG. 2 is U1, and is stored in the WEB server 20 of the address S1.

まず、クローリング処理部１１２により、コンテンツＣ１を構成する要素コンテンツＥＣ１１〜ＥＣ１３へ順次アクセスされ、そのとき、フロー情報取得部１１１により、ＷＥＢサーバ２０からクローリング処理部１１２へ送出されたデータ（トラフィック）に基づくフロー情報が取得される。そして、フロー情報取得部１１１から、コンテンツＣ１を構成する各要素コンテンツＣ１１〜Ｃ１３のフロー情報がインデックス作成部１１３に供給される。また、クローリング処理部１１２からは、コンテンツＣ１のコンテンツ情報としてＵＲＬを含む情報が、インデックス作成部１１３に供給される（Ｓ１０１）。 First, the crawling processing unit 112 sequentially accesses the element contents EC11 to EC13 constituting the content C1, and at that time, the data (traffic) sent from the WEB server 20 to the crawling processing unit 112 by the flow information acquisition unit 111 is obtained. Based flow information is acquired. Then, the flow information acquisition unit 111 supplies the flow information of the element contents C11 to C13 constituting the content C1 to the index creation unit 113. Further, the crawling processing unit 112 supplies information including a URL as the content information of the content C1 to the index creating unit 113 (S101).

具体的には、クローリング処理部１１２は、まず設定されたＵＲＬ（Ｕ１）を指定してＷＥＢサーバ２０にアクセスする処理を行い、本文の要素コンテンツＥＣ１１を読み込む。そのデータ（トラフィック）は、フロー情報取得部１１１経由してクローリング処理部１１２により読み込まれる。このとき、フロー情報取得部１１１ではアドレスＳ１、パケット数Ｐ１の情報を含むフロー情報が取得される。 Specifically, the crawling processing unit 112 first performs a process of specifying the set URL (U1) and accessing the WEB server 20, and reads the element content EC11 of the body. The data (traffic) is read by the crawling processing unit 112 via the flow information acquisition unit 111. At this time, the flow information acquisition unit 111 acquires flow information including information on the address S1 and the number of packets P1.

そして、クローリング処理部１１２は、取得した本文（要素コンテンツＥＣ１１）の記述に基づき、コンテンツＣ１を構成する他の要素コンテンツＥＣ１２、ＥＣ１３のＵＲＬを指定して、それらのフローが１本に結合しないように充分な間隔を空けてアクセスする。例えば、クローリング処理部１１２は、一つの要素コンテンツについてアクセス要求をＷＥＢサーバ２０に送出し、当該アクセス要求に係るデータ（パケット）が到来し始めた後、所定時間以上データ（パケット）の到来が無い場合には、当該要素コンテンツに関するダウンロードは終了したと見なして、次の要素コンテンツのダウンロードを開始するようにしても良い。 Then, the crawling processing unit 112 specifies the URLs of the other element contents EC12 and EC13 constituting the content C1 based on the description of the acquired body (element content EC11) so that those flows are not combined into one. Access at a sufficient interval. For example, the crawling processing unit 112 transmits an access request for one element content to the WEB server 20, and no data (packet) arrives for a predetermined time or longer after the data (packet) related to the access request starts to arrive. In this case, it may be assumed that the download related to the element content has been completed and the download of the next element content is started.

上述のような処理を繰り返すことにより、フロー情報取得部１１１では、要素コンテンツＥＣ１１、ＥＣ１２、ＥＣ１３について、それぞれパケット数Ｐ１、Ｐ２、Ｐ３が得られたものとする。そして、フロー情報取得部１１１は、それぞれの要素コンテンツに対応するフロー情報を、インデックス作成部１１３に供給する。例えば、要素コンテンツＥＣ１１に対応するフロー情報には、アドレスＳ１、パケット数Ｐ１の情報が含まれることになる。 By repeating the above processing, the flow information acquisition unit 111 obtains the packet numbers P1, P2, and P3 for the element contents EC11, EC12, and EC13, respectively. Then, the flow information acquisition unit 111 supplies flow information corresponding to each element content to the index creation unit 113. For example, the flow information corresponding to the element content EC11 includes information on the address S1 and the number of packets P1.

次に、インデックス作成部１１３は、フロー情報取得部１１１から供給された各要素コンテンツＣ１１〜Ｃ１３のフロー情報に基づいて、インデックス情報を生成して（Ｓ１０２）、インデックス管理テーブル１２１の更新処理（インデックス管理情報の追加登録）を行う（Ｓ１０３）。 Next, the index creation unit 113 generates index information based on the flow information of the element contents C11 to C13 supplied from the flow information acquisition unit 111 (S102), and updates the index management table 121 (index) (Additional registration of management information) is performed (S103).

上述のステップＳ１０３で、インデックス作成部１１３が追加登録するインデックス管理情報は、図３に示す７つのインデックス管理情報Ｋ１０〜Ｋ１６となる。 The index management information additionally registered by the index creation unit 113 in step S103 described above is the seven index management information K10 to K16 shown in FIG.

具体的には、インデックス作成部１１３は、まず、要素コンテンツＥＣ１１、ＥＣ１２、ＥＣ１３のフロー情報に基づいて、コンテンツＣ１に関する基準インデックス情報を生成する。図３では、インデックス管理情報Ｋ１０のインデックス情報が基準インデックスに該当する。インデックス作成部１１３は、要素コンテンツＥＣ１１、ＥＣ１２、ＥＣ１３について、フロー情報取得部１１１で得られたフロー情報（アドレスＳ１及びパケット数Ｐ１１、Ｐ１２、Ｐ１３）から基準インデックス情報を構成するキーを作成する。そして、インデックス作成部１１３は、このコンテンツＣ１のコンテンツ情報（ＵＲＬ：Ｕ１）を取得し、図３に示すインデックス管理情報Ｒ１０を登録する。 Specifically, the index creating unit 113 first generates reference index information related to the content C1 based on the flow information of the element contents EC11, EC12, and EC13. In FIG. 3, the index information of the index management information K10 corresponds to the reference index. The index creation unit 113 creates a key constituting the reference index information from the flow information (address S1 and the number of packets P11, P12, P13) obtained by the flow information acquisition unit 111 for the element contents EC11, EC12, and EC13. Then, the index creation unit 113 acquires the content information (URL: U1) of the content C1, and registers the index management information R10 shown in FIG.

そして、インデックス作成部１１３は、基準インデックス情報に示される３つのフローのうち２つのフローが重なり、１つのフローとして結合して観測された場合を想定した派生インデックス情報を作成し、作成したインデックス情報に基づくインデックス管理情報（図３に示すインデックス管理情報Ｒ１１〜Ｒ１４）を、インデックス管理テーブル１２１に登録する。 Then, the index creation unit 113 creates derived index information assuming that two flows among the three flows indicated in the reference index information overlap and are observed as a single flow, and the created index information Is registered in the index management table 121 (index management information R11 to R14 shown in FIG. 3).

そして、インデックス作成部１１３は、基準インデックス情報に示される３つのフローの全てが重なり、１つのフローとして結合して観測された場合を想定した派生インデックス情報を作成し、作成したインデックス情報に基づくインデックス管理情報（図３に示すインデックス管理情報Ｒ１５）をインデックス管理テーブル１２１に登録する。 Then, the index creation unit 113 creates derived index information assuming that all three flows indicated in the reference index information are overlapped and observed as one flow, and an index based on the created index information Management information (index management information R15 shown in FIG. 3) is registered in the index management table 121.

以上のように、インデックス作成部１１３は、基準インデックス情報に示されるフローが複数の場合、それらのフローの結合する組み合わせを全て求めて、その組み合わせごとの派生インデックス情報を作成し、作成した派生インデックス情報に基づくインデックス管理情報を、インデックス管理テーブル１２１に登録する。 As described above, when there are a plurality of flows indicated in the reference index information, the index creation unit 113 obtains all combinations to be combined of those flows, creates the derived index information for each combination, and creates the created derived index. Index management information based on the information is registered in the index management table 121.

また、インデックス作成部１１３は、要素コンテンツＥＣ１３については、上述の通りスタイルシートであるものと認識するために、要素コンテンツＥＣ１３のフローが発生しない場合を想定した派生インデックス情報を作成し、作成したインデックス情報に基づいて、図３に示すインデックス管理情報Ｒ１５、Ｒ１６を登録する。なお、インデックス管理情報Ｒ１６は、要素コンテンツＥＣ１１のフローと、要素コンテンツＥＣ１２のフローとが結合した場合を想定した派生インデックス情報に基づくものである。 In addition, the index creation unit 113 creates the derived index information assuming that the flow of the element content EC13 does not occur in order to recognize that the element content EC13 is a style sheet as described above, and the created index Based on the information, the index management information R15 and R16 shown in FIG. 3 are registered. The index management information R16 is based on derived index information that assumes a case where the flow of the element content EC11 and the flow of the element content EC12 are combined.

以上のように、インデックス作成部１１３は、基準インデックス情報に示されるフローのうち、上述のようにダウンロードが省略される可能性のある要素コンテンツに係るフローについて省略した派生インデックス情報を作成する。また、インデックス作成部１１３は、省略される可能性のある要素コンテンツが複数ある場合には、その省略の組み合わせ全てについて派生インデックス情報を作成する。さらに、インデックス作成部１１３は、一部のフローが省略された派生インデックス情報について、一部又は全部のフローが結合した場合の派生インデックス情報も作成する。 As described above, the index creating unit 113 creates the derived index information that is omitted for the flow related to the element content that may be downloaded as described above, among the flows indicated in the reference index information. In addition, when there are a plurality of element contents that may be omitted, the index creating unit 113 creates derived index information for all the omitted combinations. Further, the index creating unit 113 creates derived index information when some or all flows are combined with respect to the derived index information from which some flows are omitted.

次に、推定処理部１３の動作について、図５のフローチャートを用いて説明する。具体的には、以下では、ユーザ端末３０がＷＥＢサーバ２０のコンテンツＣ１（上述の図２）にアクセスした場合に、推定処理部１３が、そのトラフィックを観測した結果に基づくフロー情報に基づいてインデックス情報を作成し、ユーザ端末３０がアクセスしてコンテンツを特定する動作について説明する。 Next, operation | movement of the estimation process part 13 is demonstrated using the flowchart of FIG. Specifically, in the following, when the user terminal 30 accesses the content C1 (FIG. 2 described above) of the WEB server 20, the estimation processing unit 13 performs an index based on the flow information based on the result of observing the traffic. The operation of creating information and accessing the user terminal 30 to specify content will be described.

まず、ユーザ端末３０がＷＥＢサーバ２０のコンテンツＣ１にアクセスし、ＷＥＢサーバ２０からユーザ端末３０へのデータ（パケット）が、フロー情報取得部１３１を経由してユーザ端末３０に到達する。この時、ユーザ端末３０では、コンテンツＣ１を構成する要素コンテンツＥＣ１１〜ＥＣ１３が取得される（Ｓ２０１）。これにより、フロー情報取得部１３１ではユーザ端末３０に送られる要素コンテンツＥＣ１１〜ＥＣ１３に係るフロー情報が取得される。 First, the user terminal 30 accesses the content C 1 of the WEB server 20, and data (packets) from the WEB server 20 to the user terminal 30 reaches the user terminal 30 via the flow information acquisition unit 131. At this time, the user terminal 30 acquires the element contents EC11 to EC13 constituting the content C1 (S201). Thereby, the flow information acquisition unit 131 acquires the flow information related to the element contents EC11 to EC13 sent to the user terminal 30.

ここで、フロー情報取得部１３１により取得されるフロー情報は、クローリング処理部１１２によるクローリング時にフロー情報取得部１１１で得られる物と同じ（即ち、基準インデックス情報と同様の内容）となる場合が多いと考えられる。ここでは、フロー情報取得部１３１により、アドレスが「Ｓ１」でパケット数が「Ｐｘ１」のフローと、アドレスが「Ｓ１」でパケット数が「Ｐｘ２」のフローと、アドレスが「Ｓ１」でパケット数が「Ｐｘ３」のフローという３つのフローに関するフロー情報が取得されたものとする。そして、それぞれのフローに係るフロー情報（アドレス及びパケット数を含む情報）が、フロー情報取得部１３１からコンテンツ情報検索部１３２に供給されたものとする。 Here, the flow information acquired by the flow information acquisition unit 131 is often the same as that obtained by the flow information acquisition unit 111 during crawling by the crawling processing unit 112 (that is, the same content as the reference index information). it is conceivable that. Here, the flow information acquisition unit 131 uses the flow with the address “S1” and the number of packets “Px1”, the flow with the address “S1” and the number of packets “Px2”, and the number of packets with the address “S1”. It is assumed that flow information related to three flows called “Px3” is acquired. It is assumed that flow information (information including an address and the number of packets) related to each flow is supplied from the flow information acquisition unit 131 to the content information search unit 132.

次に、コンテンツ情報検索部１３２では、フロー情報取得部１３１から供給されたフローごとのフロー情報に基づいて、検索対象インデックス情報が生成される（Ｓ２０２）。 Next, the content information search unit 132 generates search target index information based on the flow information for each flow supplied from the flow information acquisition unit 131 (S202).

具体的には、コンテンツ情報検索部１３２は、アドレスが「Ｓ１」、フロー数が「３」、フローごとのパケット数がそれぞれ「Ｐｘ１」、「Ｐｘ２」、「Ｐｘ３」という検索対象インデックス情報が生成される。 Specifically, the content information search unit 132 generates search target index information with the address “S1”, the number of flows “3”, and the number of packets for each flow “Px1”, “Px2”, and “Px3”, respectively. Is done.

次に、コンテンツ情報検索部１３２では、生成した検索対象インデックス情報と、インデックス管理テーブル１２１の各インデックス情報を照合して、一致すると認められるインデックス情報（コンテンツ情報）を検出する（Ｓ２０３）。 Next, the content information search unit 132 collates the generated search target index information with each index information in the index management table 121, and detects index information (content information) that is recognized to match (S203).

具体的には、コンテンツ情報検索部１３２は、まず、検索対象インデックス情報の「アドレス」及び「フロー数」が一致するインデックス情報を検出する。そして、コンテンツ情報検索部１３２は、検出されたインデックス情報のフローごとのパケット数と、検索対象インデックス情報のパケット数とを照合する処理を行う。フロー数が複数である場合、コンテンツ情報検索部１３２は、検索対象インデックス情報のパケット数と、検出されたインデックス情報のパケット数とを照合するための組合せを決定する必要がある。コンテンツ情報検索部１３２において、上述の照合するための組合せを決定する方法は限定されないものであるが、例えば、最も値が近いパケット数同士を組み合わせて比較するようにしても良いし、最も差分の合計が少なくなる組合せを求めて比較するようにしても良い。 Specifically, the content information search unit 132 first detects index information in which “address” and “number of flows” of the search target index information match. And the content information search part 132 performs the process which collates the packet number for every flow of the detected index information, and the packet number of search object index information. When the number of flows is plural, the content information search unit 132 needs to determine a combination for checking the number of packets of the search target index information and the number of packets of the detected index information. In the content information search unit 132, the method for determining the combination for the above-described collation is not limited. For example, the number of packets having the closest values may be combined and compared, or the difference of the most difference may be determined. A combination that reduces the total may be obtained and compared.

ここでは、例として、コンテンツ情報検索部１３２において、図３に示すインデックス管理情報Ｋ１０のインデックス情報が、検索対象インデックス情報とアドレス及びフロー数が一致するものとして検出されたものとする。そして、コンテンツ情報検索部１３２では、「Ｐｘ１とＰ１１」、「Ｐｘ２とＰ１２」、「Ｐｘ３とＰ１３」という３つの組合せについてパケット数の比較を行うと決定されたものとする。そして、コンテンツ情報検索部１３２では、それぞれの組合せについてパケット数が比較され、所定の範囲内の差分となっているか否かが判断されるものとする。例えば、コンテンツ情報検索部１３２は、インデックス情報のパケット数に対して、検索対象インデックス情報のパケット数が±３％以内の誤差（許容範囲は定数としても良い）であれば、当該組合せについては一致するものと認めるものとするようにしても良い。そして、コンテンツ情報検索部１３２は、全ての組合せ（フロー）について、パケット数が一致するものと認められる場合には、当該インデックス情報は、検索対象インデックス情報と一致するものとして検出するものとする。なお、コンテンツ情報検索部１３２は、全ての組合せ（フロー）について、パケット数が一致するものと認められなくても、パケット数が一致しない組合せの数が所定以下であれば、当該インデックス情報は、検索対象インデックス情報と一致するものとして検出するようにしても良い。 Here, as an example, it is assumed that the content information search unit 132 detects the index information of the index management information K10 shown in FIG. 3 as the search target index information having the same address and the number of flows. Then, it is assumed that the content information search unit 132 determines to compare the number of packets for three combinations of “Px1 and P11”, “Px2 and P12”, and “Px3 and P13”. Then, the content information search unit 132 compares the number of packets for each combination and determines whether or not the difference is within a predetermined range. For example, the content information search unit 132 matches the combination if the number of packets of the search target index information is within ± 3% (the allowable range may be a constant) with respect to the number of packets of the index information. You may make it admit that it does. Then, when it is recognized that the number of packets matches for all combinations (flows), the content information search unit 132 detects that the index information matches the search target index information. Note that the content information search unit 132 does not recognize that the number of packets matches for all combinations (flows), but if the number of combinations that do not match the number of packets is equal to or less than a predetermined number, the index information is You may make it detect as what corresponds with search object index information.

そして、ここでは、インデックス作成部１１３により、検索対象インデックス情報と一致するインデックス情報として、図３に示すインデックス管理情報Ｋ１０のインデックス情報と、コンテンツＣ２に係るインデックス管理情報（以下、「Ｋ２０」と表わす）のインデックス情報（図示せず）とが検出されたものとする。なお、インデックス管理情報Ｋ２０に係るインデックス情報は、アドレスが「Ｓ１」、フロー数が「３」、元のフロー数が「３」という内容であるものとする。 In this example, the index creation unit 113 represents the index information of the index management information K10 shown in FIG. 3 and the index management information related to the content C2 (hereinafter referred to as “K20”) as index information that matches the search target index information. ) Index information (not shown) is detected. It is assumed that the index information related to the index management information K20 has an address “S1”, a flow number “3”, and an original flow number “3”.

そして、コンテンツ情報検索部１３２は、検出したインデックス情報と、そのインデックス情報に対応するコンテンツ情報（ＵＲＬ）とを、信頼度算出部１３３に供給する。 Then, the content information search unit 132 supplies the detected index information and content information (URL) corresponding to the index information to the reliability calculation unit 133.

そして、信頼度算出部１３３では、コンテンツ情報検索部１３２から供給されたインデックス情報のそれぞれについて、信頼度を算出する（Ｓ２０４）。 Then, the reliability calculation unit 133 calculates the reliability for each of the index information supplied from the content information search unit 132 (S204).

ここでは、信頼度算出部１３３は、インデックス管理情報Ｋ１０のインデックス情報と、インデックス管理情報Ｋ２０のインデックス情報のそれぞれについて、上記の（１）式を用いて信頼度を算出することになる。 Here, the reliability calculation unit 133 calculates the reliability for each of the index information of the index management information K10 and the index information of the index management information K20 using the above equation (1).

インデックス管理情報Ｋ１０については、元のフロー数ｆ０＝３、フロー数ｆ１＝３、検索された全コンテンツ数ｄｆ＝２なので、信頼度Ａは、３／（３×√２）≒０．７１となる。また、インデックス管理情報Ｋ２０については、ｆ０＝４、ｆ１＝３、ｄｆ＝２なので、信頼度Ａは、３／４×√２）≒０．５３となる。 For the index management information K10, since the original number of flows f0 = 3, the number of flows f1 = 3, and the total number of searched contents df = 2, the reliability A is 3 / (3 × √2) ≈0.71. Become. For the index management information K20, since f0 = 4, f1 = 3, and df = 2, the reliability A is 3/4 × √2) ≈0.53.

そして、信頼度算出部１３３は、コンテンツ情報検索部１３２から供給されたインデックス情報に対応するコンテンツ情報（ＵＲＬ）と、対応する信頼度の情報とを出力部１３４に供給する。 Then, the reliability calculation unit 133 supplies content information (URL) corresponding to the index information supplied from the content information search unit 132 and corresponding reliability information to the output unit 134.

そして、出力部１３４は、信頼度算出部１３３から供給されたコンテンツ情報と信頼度の情報とを所定の方式により出力する処理を行う（Ｓ２０５）。 Then, the output unit 134 performs processing for outputting the content information and the reliability information supplied from the reliability calculation unit 133 by a predetermined method (S205).

なお、出力部１３４は、信頼度算出部１３３から複数のコンテンツ情報が供給された場合に、全てのコンテンツ情報を出力するようにしても良いし、一部のコンテンツ情報（例えば、信頼度が最も高いコンテンツ情報）のみを出力するようにしても良い。また、出力部１３４は、信頼度の情報を省略してコンテンツ情報のみを出力するようにしても良い。また、出力部１３４は、コンテンツ情報と共に他の情報を対応付けて出力するようにしても良い。例えば、出力部１３４は、現在の日時や、ユーザ端末３０の識別子（例えば、ＩＰアドレスやホスト名）等を保持してコンテンツ情報と共に出力するようにしても良い。 The output unit 134 may output all content information when a plurality of pieces of content information are supplied from the reliability calculation unit 133, or some content information (for example, the reliability is the highest). Only high content information) may be output. Further, the output unit 134 may output only content information by omitting reliability information. The output unit 134 may output other information in association with the content information. For example, the output unit 134 may hold the current date and time, an identifier (for example, an IP address or a host name) of the user terminal 30, and the like, and output it together with the content information.

（Ａ−３）実施形態の効果
この実施形態によれば、以下のような効果を奏することができる。 (A-3) Effects of Embodiment According to this embodiment, the following effects can be achieved.

コンテンツ推定装置１０の推定処理部１３では、ＷＥＢサーバ２０から送出されるパケットのデータ（ペイロード）参照することなく、トラフィックのフロー情報だけで、当該トラフィックの内容（コンテンツ）を推定することができる。これにより、コンテンツ推定装置１０では、パケットの中のＵＲＬを指定した情報が暗号化されて読めない場合や、観測点を通過するトラフィック（パケット数）が多すぎて全てについてパケットの中身を解析してＵＲＬを得ることができない場合や、パケットの中身を見ることができずフロー情報しか取得できない装置でしかトラフィックの観測ができない場合でも、ユーザ（ユーザ端末３０）のアクセスしたコンテンツを特定することができる。 The estimation processing unit 13 of the content estimation apparatus 10 can estimate the content (content) of the traffic only from the flow information of the traffic without referring to the data (payload) of the packet transmitted from the WEB server 20. As a result, the content estimation apparatus 10 analyzes the contents of the packet when the information specifying the URL in the packet is encrypted and cannot be read or when there is too much traffic (number of packets) passing through the observation point. Even if the URL cannot be obtained or the traffic can be observed only by a device that can only see the flow information and can only acquire the flow information, the content accessed by the user (user terminal 30) can be specified. it can.

さらに、コンテンツ推定装置１０を利用することにより、ユーザの嗜好情報分析等のアクセスされたＵＲＬ情報を分析処理する装置（例えば、データマイニング処理を行うサーバ装置等）において、直接パケット中のＵＲＬを取得することができない、あるいは処理負荷等の問題でＵＲＬを取得することが効率が悪い場合でも、分析処理に必要な情報（例えば、ユーザ端末３０がアクセスしたＵＲＬ等を含むコンテンツ情報）を保持することができる。 Further, by using the content estimation device 10, the URL in the packet is directly acquired by a device that analyzes the accessed URL information such as user preference information analysis (for example, a server device that performs data mining processing). Even if the URL cannot be acquired due to problems such as processing load or the efficiency is low, information necessary for analysis processing (for example, content information including the URL accessed by the user terminal 30) is retained. Can do.

（Ｂ）他の実施形態
本発明は、上記の実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (B) Other Embodiments The present invention is not limited to the above-described embodiments, and may include modified embodiments as exemplified below.

（Ｂ−１）上記の実施形態では、情報生成部１１が、インデックス管理テーブル１２１にインデックス管理情報を追加していく処理を行う処理についてのみ説明したが、複数回同じコンテンツについてクローリング処理部１１２がアクセスした場合には、その時にフロー情報取得部１１１で取得された最新のフロー情報に基づいて、インデックス管理テーブル１２１の内容を更新するようにしてもよい。例えば、コンテンツに広告用のバナー等、動的に変化する要素コンテンツが含まれている場合には、フロー情報の内容（特にパケット数）が変化する場合がある。 (B-1) In the above embodiment, the information generation unit 11 has been described only for the process of adding index management information to the index management table 121. However, the crawling processing unit 112 performs the same content multiple times. When accessed, the contents of the index management table 121 may be updated based on the latest flow information acquired by the flow information acquisition unit 111 at that time. For example, if the content includes dynamically changing element content such as an advertisement banner, the content of flow information (particularly the number of packets) may change.

インデックス作成部１１３は、コンテンツ情報記憶部１２に同一のコンテンツのＵＲＬが登録されていた場合、今回得られたフロー情報と、インデックス管理テーブル１２１に登録されている基準インデックス情報とで、各フロー（各要素コンテンツ）のパケット数を比較する。そして、インデックス作成部１１３は、比較の結果パケット数等データサイズが異なる場合には、当該基準インデックス情報、及び、当該基準インデックス情報から派生した派生インデックス情報のパケット数を、最新に得られたパケット数と、過去に得られたパケット数を含む範囲を示す情報に更新する処理を行う。 When the URL of the same content is registered in the content information storage unit 12, the index creation unit 113 uses the flow information obtained this time and the reference index information registered in the index management table 121 for each flow ( The number of packets of each element content) is compared. Then, if the data size such as the number of packets is different as a result of the comparison, the index creating unit 113 determines the number of packets of the reference index information and the derived index information derived from the reference index information as the latest obtained packet. The number is updated to information indicating a range including the number and the number of packets obtained in the past.

例えば、既に、インデックス管理テーブル１２１に、コンテンツＣ１に関するインデックス管理情報Ｋ１０〜Ｋ１６が図３に示すように登録された状態で、クローリング処理部１１２が再度コンテンツＣ１にアクセスし、フロー情報が取得されたた場合に、インデックス作成部１１３は、最新のフロー情報に基づいて、インデックス管理テーブル１２１の内容を更新するようにしても良い。そして、ここでは、要素コンテンツＥＣ１２が、バナー広告の画像であり、要素コンテンツＥＣ１２に係るパケット数が従前のＰ１２より多い数（以下、「Ｐ１２’」と表わす）となった場合を想定する。 For example, with the index management information K10 to K16 relating to the content C1 already registered in the index management table 121 as shown in FIG. 3, the crawling processing unit 112 accesses the content C1 again, and the flow information is acquired. In this case, the index creation unit 113 may update the contents of the index management table 121 based on the latest flow information. Here, it is assumed that the element content EC12 is an image of a banner advertisement, and the number of packets related to the element content EC12 is larger than the previous P12 (hereinafter referred to as “P12 ′”).

この場合、インデックス作成部１１３は、ＵＲＬがＵ１のインデックス管理情報（コンテンツＣ１に対応するインデックス管理情報Ｋ）のインデックス情報を構成するパケット数の項目で、Ｐ１２となっている部分をＰ１２〜Ｐ１２’（Ｐ１２＞Ｐ１２’の場合は、Ｐ１２’〜Ｐ１２）の範囲を取りえるように「Ｐ１２〜Ｐ１２’」と書き換える処理を行う。 In this case, the index creation unit 113 sets P12 to P12 ′ as P12 in the item of the number of packets constituting the index information of the index management information (index management information K corresponding to the content C1) whose URL is U1. In the case of (P12> P12 ′), a process of rewriting “P12 to P12 ′” is performed so that the range of P12 ′ to P12 can be taken.

その結果、インデックス管理情報Ｋ１０〜Ｋ１６は、図６に示すような内容になる。 As a result, the index management information K10 to K16 has contents as shown in FIG.

そして、インデックス管理情報においてパケット数の項目に上述の「Ｐ１２〜Ｐ１２’」のような範囲指定の表現が加わった場合には、コンテンツ情報検索部１３２で検索対象インデックス情報のパケット数を比較する処理についても、上述の範囲指定を考慮した比較を行う必要がある。 When the range designation expression such as “P12 to P12 ′” described above is added to the item of the number of packets in the index management information, the content information search unit 132 compares the number of packets of the search target index information. Also, it is necessary to make a comparison in consideration of the above-mentioned range specification.

ここでは、例として、コンテンツ情報検索部１３２が、検索対象インデックス情報のパケット数Ｐ２ｘと、図４に示すインデックス管理情報Ｋ１０を構成するインデックス情報のパケット数「Ｐ１２〜Ｐ１２’」とを比較する場合について説明する。 Here, as an example, the content information search unit 132 compares the number of packets P2x of the search target index information with the number of packets of index information “P12 to P12 ′” constituting the index management information K10 shown in FIG. Will be described.

コンテンツ情報検索部１３２は、例えば、「Ｐ２ｘ」が「Ｐ１２〜Ｐ１２’」の範囲内の値である場合に、「Ｐ２ｘ」と「Ｐ１２〜Ｐ１２’」とを一致するパケット数と判定するようにしてもよい。また、コンテンツ情報検索部１３２は、「Ｐ１２〜Ｐ１２’」からはずれた場合であっても、所定の範囲内（例えば、±３％以内）の誤差であれば「Ｐ２ｘ」と「Ｐ１２〜Ｐ１２’」とを一致するパケット数と判定するようにしてもよい。具体的には、例えば、「Ｐ２ｘ」が「（Ｐ１２×０．９７）〜（Ｐ１２’×１．０３）」の範囲内である場合に、「Ｐ２ｘ」と「Ｐ１２〜Ｐ１２’」とを一致するパケット数と判定するようにしてもよい。 For example, when “P2x” is a value within the range of “P12 to P12 ′”, the content information search unit 132 determines that “P2x” and “P12 to P12 ′” are the number of matching packets. May be. In addition, even if the content information search unit 132 deviates from “P12 to P12 ′”, if the error is within a predetermined range (for example, within ± 3%), “P2x” and “P12 to P12 ′”. May be determined as the number of matching packets. Specifically, for example, when “P2x” is within the range of “(P12 × 0.97) to (P12 ′ × 1.03)”, “P2x” and “P12 to P12 ′” match. The number of packets to be determined may be determined.

（Ｂ−２）上記の実施形態では、インデックス管理テーブル１２１のコンテンツ情報として、ＵＲＬの情報を登録しているが、当該ＵＲＬに係るコンテンツの内容の一部や、当該コンテンツに係る特徴を示す情報も併せて登録するようにしてもよい。例えばコンテンツ中の単語を用いて噂好情報を分析するシステム（例えば、データマイニングのシステム等）に、コンテンツ推定装置１０の推定結果を利用する場合には、コンテンツ情報としてコンテンツ中に登場する単語（例えば、所定以上の数検出される単語や、ＳＥＯ（ＳｅａｒｃｈＥｎｇｉｎｅＯｐｔｉｍｉｚａｔｉｏｎ）等により指定されたキーワード等）を登録するようにしてもよい。これにより、上述のコンテンツ推定装置１０の推定結果を利用するシステムにおいて、別途コンテンツのＵＲＬとコンテンツ中の単語を対応付けるようなＤＢを用意してそこから単語を取得する等の処理を一括化でき、システム全体の効率化が可能となる。 (B-2) In the above embodiment, URL information is registered as the content information of the index management table 121. However, information indicating a part of the content related to the URL and characteristics related to the content. May also be registered. For example, when the estimation result of the content estimation device 10 is used for a system that analyzes rumors information using a word in the content (for example, a data mining system), a word that appears in the content as content information ( For example, a predetermined number of detected words or keywords specified by SEO (Search Engine Optimization) may be registered. Thereby, in the system that uses the estimation result of the content estimation device 10 described above, it is possible to batch processes such as separately preparing a DB that associates the URL of the content with the word in the content and acquiring the word therefrom, The efficiency of the entire system can be improved.

（Ｂ−３）上記の実施形態では、情報生成部１１のフロー情報取得部１１１と、推定処理部１３のフロー情報取得部１３１とは別個の構成要素として説明しているが、ネットワークの構成によって（例えば、フローの観測点の位置が同じ場合等）は１つの構成要素として構築するようにしてもよい。 (B-3) In the above embodiment, the flow information acquisition unit 111 of the information generation unit 11 and the flow information acquisition unit 131 of the estimation processing unit 13 are described as separate components, but depending on the network configuration (For example, when the positions of the flow observation points are the same) may be constructed as one component.

（Ｂ−４）上記の実施形態のインデックス作成部１１３では、基準インデックス情報だけでなく、基準インデックス情報から派生した派生インデックス情報まで作成しているが、上述の派生インデックス情報の一部又は全部について生成を省略するようにしてもよい。上記の実施形態では、インデックス作成部１１３は、当該基準インデックス情報が示すフローが結合した場合を考慮した派生インデックス情報と、当該基準インデックス情報が示すフローの一部が省略された場合を考慮した派生インデックスとを作成しているが、上述の条件のうち、一部の条件だけを考慮して派生インデックス情報を作成するようにしてもよい。例えば、インデックス作成部１１３は、当該基準インデックス情報が示すフローが結合した場合だけを考慮した派生インデックス情報を作成し、当該基準インデックス情報が示すフローの一部が省略された場合の派生インデックス情報の生成を行わないようにしてもよい。 (B-4) The index creation unit 113 of the above embodiment creates not only the reference index information but also the derived index information derived from the reference index information. The generation may be omitted. In the above embodiment, the index creating unit 113 derives the derived index information considering the case where the flows indicated by the reference index information are combined and the case where a part of the flow indicated by the reference index information is omitted. Although an index is created, derived index information may be created in consideration of only a part of the above conditions. For example, the index creation unit 113 creates derived index information that considers only when the flows indicated by the reference index information are combined, and the derived index information when a part of the flow indicated by the reference index information is omitted. Generation may not be performed.

（Ｂ−５）上記の実施形態では、各フローのデータ量を示す値としてパケット数を観測いているが、パケットのデータ量の累積値（ペイロードだけのデータ量の累積値としてもよい）を観測するようにしてもよい。 (B-5) In the above embodiment, the number of packets is observed as a value indicating the data amount of each flow, but an accumulated value of the packet data amount (may be an accumulated value of the data amount of only the payload) is observed. You may make it do.

（Ｂ−６）上記の実施形態では、推定処理部１３に信頼度算出部１３３が搭載されているが、信頼度算出部１３３については省略した構成（又は、設定に応じて動作を停止可能な構成）としてもよい。 (B-6) In the above embodiment, the estimation processing unit 13 includes the reliability calculation unit 133. However, the reliability calculation unit 133 is omitted (or operation can be stopped according to the setting). Configuration).

（Ｂ−７）上記の実施形態では、コンテンツ推定装置１０に、情報生成部１１が搭載されているが、情報生成部１１を省略し、外部からインデックス管理テーブル１２１のデータを保持して推定処理部１３で利用するようにしてもよい。また、コンテンツ推定装置１０では、コンテンツ情報記憶部１２を備えずに、外部の記憶装置として構築された記憶手段からインデックス管理テーブル１２１を読み込んで保持するようにしてもよい。さらに、情報生成部１１又は、推定処理部１３だけで単独の情報処理装置として構築するようにしてもよい。 (B-7) In the above embodiment, the information generation unit 11 is mounted on the content estimation device 10, but the information generation unit 11 is omitted, and the estimation processing is performed by holding the data of the index management table 121 from the outside. You may make it utilize in the part 13. FIG. Further, the content estimation device 10 may read and hold the index management table 121 from a storage unit constructed as an external storage device without including the content information storage unit 12. Further, the information generation unit 11 or the estimation processing unit 13 alone may be constructed as a single information processing apparatus.

１０…コンテンツ推定装置、１１…情報生成部、１１１…フロー情報取得部、１１２…クローリング処理部、１１３…インデックス作成部、１２…コンテンツ情報記憶部、１２１…インデックス管理テーブル、１３…推定処理部、１３１…フロー情報取得部、１３２…コンテンツ情報検索部、１３３…信頼度算出部、１３４…出力部、２０…ＷＥＢサーバ、Ｃ１〜Ｃ４…コンテンツ、ＥＣ１１〜ＥＣ１３…要素コンテンツ、３０…ユーザ端末、Ｎ１…インターネット、Ｎ２…アクセスネットワーク。 DESCRIPTION OF SYMBOLS 10 ... Content estimation apparatus, 11 ... Information generation part, 111 ... Flow information acquisition part, 112 ... Crawling process part, 113 ... Index production part, 12 ... Content information storage part, 121 ... Index management table, 13 ... Estimation process part, 131 ... Flow information acquisition unit, 132 ... Content information search unit, 133 ... Reliability calculation unit, 134 ... Output unit, 20 ... WEB server, C1-C4 ... Content, EC11-EC13 ... Element content, 30 ... User terminal, N1 ... Internet, N2 ... Access network.

Claims

A first information holding means for holding first feature information including a feature amount relating to a flow of data when each transmission data is transmitted from a data transmission device holding a plurality of transmission data;
Second information holding means for holding second feature information including feature quantities relating to the flow of data when any transmission data is transmitted from the data transmission device;
Collation processing means for performing processing for collating the second feature information held by the second information holding means and the first feature information held by the first information holding means;
An information processing apparatus comprising: an estimation processing unit that estimates transmission data transmitted by the data transmission device using a collation result of the collation processing unit.

Transmission data requesting means for requesting the data transmission device to transmit each transmission data;
When the transmission data requested by the transmission data request means is transmitted from the data transmission device, the flow of the data is observed, and the first characteristic information relating to the transmission data is generated based on the observation result Means and
The information processing apparatus according to claim 1, wherein the first information holding unit holds first information for each transmission data generated by the feature information generation unit.

When the transmission data is composed of a plurality of element data, the transmission data request means requests the data transmission device to transmit the element data at different timings,
The feature information generation unit acquires, for each element data, a feature amount related to a data flow when transmitted from the data transmission device, and acquires each element acquired in the first feature information related to the transmission data. The information processing apparatus according to claim 2, further comprising information on a feature amount of data.

Computer
A first information holding means for holding first feature information including a feature amount relating to a flow of data when each transmission data is transmitted from a data transmission device holding a plurality of transmission data;
Second information holding means for holding second feature information including feature quantities relating to the flow of data when any transmission data is transmitted from the data transmission device;
Collation processing means for performing processing for collating the second feature information held by the second information holding means and the first feature information held by the first information holding means;
An information processing program that functions as an estimation processing unit that estimates transmission data transmitted by the data transmission device using a verification result of the verification processing unit.