JP2010511246A

JP2010511246A - Search log misuse prevention method and apparatus

Info

Publication number: JP2010511246A
Application number: JP2009539187A
Authority: JP
Inventors: キム，ヨン−ダイ; オー，ジャン・ミン; チョイ，ジェ・ゴル; キム，ドン・ウク; リー，ユン・シク
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2006-11-29
Filing date: 2007-11-29
Publication date: 2010-04-08
Anticipated expiration: 2027-11-29
Also published as: JP5118707B2; KR20080048827A; WO2008066341A1; KR100837334B1

Abstract

本発明は、検索ログの悪用を防止する方法を開示する。本発明の一実施例による検索ログ悪用防止方法は、検索ログから異常行為を検査する対象を選別する段階と、選別された対象に対して正常から外れた度合を点数化し、異常行為を検出する段階と、を含む。また、検索ログ悪用防止方法は、所定の減点ロジックを用いて、検出された異常行為を検索ログから除去し、検索ログを訂正する段階をさらに含むことができる。したがって、検索ログから検出された異常行為を效率的に除去することによって検索ログに対する悪用を防止し、検索ログをきれいに維持することができる。 The present invention discloses a method for preventing misuse of search logs. A method for preventing abuse of a search log according to an embodiment of the present invention includes a step of selecting a target to be inspected for abnormal behavior from the search log, and scoring the degree of deviation from normal for the selected target to detect abnormal behavior. Stages. The search log abuse prevention method may further include a step of removing the detected abnormal action from the search log using a predetermined deduction logic and correcting the search log. Therefore, by effectively removing abnormal acts detected from the search log, misuse of the search log can be prevented and the search log can be kept clean.

Description

本発明は、インターネット検索に関するもので、特に、検索ログ（Search Logs）の悪用（Abusing）を效率的に防止する方法及びその装置に関する。 The present invention relates to Internet search, and more particularly, to a method and apparatus for effectively preventing abuse of search logs (Abusing).

近年、インターネットの発達によってインターネットを用いた多様なサービスがユーザーに提供されており、その最も代表的なものは検索サービスといえる。検索サービスとは、検索サービスプロバイダにより提供される検索サイトの検索窓にユーザーが検索語を入力すると、検索サービスプロバイダが、入力された検索語に相応する情報を検索結果として提供することを意味する。 In recent years, with the development of the Internet, various services using the Internet have been provided to users, the most representative of which is a search service. The search service means that when a user inputs a search word in a search window of a search site provided by the search service provider, the search service provider provides information corresponding to the input search word as a search result. .

このように検索サービスを利用するために各ユーザーが入力する検索語及び各ユーザーの検索行為に関する情報は検索ログの形態で保存され、検索サービスプロバイダは、このような検索ログを分析することによってユーザーに多様な検索サービスを提供することができる。 In this way, the search terms entered by each user to use the search service and the information about each user's search behavior are stored in the form of a search log, and the search service provider analyzes the search log by analyzing the search log. Can provide a variety of search services.

例えば、キーワード広告では、キーワードの人気度に基づいて課金が決定される。ここで、人気度とは、検索ログ分析を通じて獲得した検索語の様相に基づいて決定されるもので、このような人気度に基づいて検索サービスプロバイダは広告要請者に中立的で正当な課金根拠を提示することができる
また、検索サービスプロバイダは、検索ログを用いて様々な１次、２次サービスを提供している。例えば、人気検索語、関連検索語サービスなどは、検索ログを用いて現在ユーザーの関心を受けている検索語、連関性のある検索語を提示している。このようなサービスに成功できたのは、ぼう大な検索ログがインターネットユーザーの純粋な意図の産物という前提を満たしたためである。 For example, for keyword advertisements, charging is determined based on keyword popularity. Here, popularity is determined based on the aspect of the search term acquired through search log analysis. Based on such popularity, the search service provider provides a neutral and legitimate charging basis for the advertisement requester. In addition, search service providers provide various primary and secondary services using search logs. For example, popular search terms, related search term services, and the like use search logs to present search terms that are currently receiving user interest and related search terms. This service was successful because the vast search log met the premise that it was a product of the pure intention of Internet users.

しかしながら、最近では、特定個人、特定集団の不正な意図が反映されるように検索ログに歪曲を加えようとする試みが増加してきている。これらの比重は今後も益々大きくなると推測される。このような検索ログの悪用（abusing）行為は検索ログを汚染させ、検索ログに依存している収益モデルの信頼墜落、サービスの品質低下を招くという問題点がある。 However, recently, attempts to add distortion to the search log so as to reflect the unintended intentions of specific individuals and specific groups are increasing. It is estimated that these specific gravity will continue to increase. Such an abusing act of search logs has a problem in that it contaminates the search logs, leading to a decline in the reliability of the profit model dependent on the search logs and a decrease in service quality.

本発明は上記の問題点を解決するためのもので、その目的は、検索ログを追跡及び分析することによって、異常行為を検出し、汚染された部分を除去するための検索ログ悪用防止方法及びその装置を提供することにある。 The present invention is intended to solve the above-described problems, and an object of the present invention is to detect an abnormal action by tracking and analyzing a search log, and to prevent a search log abuse for removing a contaminated part. It is to provide such a device.

上記目的を達成するための本発明の一側面による検索ログ悪用防止方法は、検索ログから異常行為を検査する対象を選別する段階と、選別された対象に対して正常から外れた度合を点数化し、異常行為を検出する段階を含む。一実施例において、前記検索ログ悪用防止方法は、所定の減点ロジックを用いて、前記検索ログから前記検出された異常行為を除去することによって前記検索ログを訂正する段階をさらに含むことができる。 In order to achieve the above object, a search log misuse prevention method according to an aspect of the present invention includes a step of selecting a target to be inspected for abnormal acts from a search log, and scoring the degree of deviation from normal for the selected target. , Including detecting abnormal behavior. In one embodiment, the search log abuse prevention method may further include correcting the search log by removing the detected abnormal behavior from the search log using a predetermined deduction logic.

前記検査対象選別段階は、前記検索ログから所定の時間ウィンドウ内に含まれた特定検索語の各ＩＰ別入力回数を統計的に解析した検索語要約情報及び特定ＩＰにおける各検索語の入力回数を統計的に解析したＩＰ要約情報のうち少なくとも一つを生成する段階を含み、前記異常行為検出段階で、前記検索語要約情報及びＩＰ要約情報のうち少なくとも一つから前記異常行為を検出する。ここで、ＩＰとは、インターネットプロトコル(Internet Protocol)を意味する。 The inspection object selection step includes: search word summary information obtained by statistically analyzing the number of times of input of each specific search word included in a predetermined time window from the search log; and the number of times each search word is input in the specific IP. Generating at least one of statistically analyzed IP summary information, and detecting the abnormal behavior from at least one of the search word summary information and the IP summary information in the abnormal behavior detection step. Here, IP means the Internet Protocol.

ここで、前記要約情報生成段階は、前記検索ログから所定の時間ウィンドウ内に含まれた特定検索語の各ＩＰ別入力回数ベクトル及び特定ＩＰにおける各検索語の入力回数ベクトルのうち少なくとも一つを生成する段階と、前記検索語要約情報を生成するために前記特定検索語の各ＩＰ別入力回数ベクトルの次元を縮小したり、前記ＩＰ要約情報を生成するために前記特定ＩＰにおける各検索語の入力回数ベクトルの次元を縮小する段階を含む。 Here, in the summary information generation step, at least one of an input frequency vector for each IP of a specific search word and an input frequency vector for each search word in a specific IP included within a predetermined time window from the search log is obtained. Generating the search term summary information, reducing the dimension of the input frequency vector for each IP of the specific search term to generate the search term summary information, or generating the IP summary information for each search term in the specific IP Including reducing the dimension of the input count vector.

一方、前記入力回数ベクトルの次元縮小段階は、ハッシュバケツ（hashed-bucket）を用いて前記特定検索語の各ＩＰ別入力回数ベクトル及び特定ＩＰにおける各検索語の入力回数ベクトルを、制限された数のバケツに対する回数ベクトルに変換する。 Meanwhile, in the dimension reduction step of the input count vector, a limited number of input count vectors for each IP of the specific search word and input count vectors of each search term in the specific IP using a hashed bucket. Convert to a vector of times for a bucket of.

この時、前記検索語要約情報及びＩＰ要約情報は統計的方法を用いて多次元分布（Distribution）にモデリングされることを特徴とする。 At this time, the search word summary information and the IP summary information are modeled into a multi-dimensional distribution using a statistical method.

一方、前記異常行為検出段階は、前記多次元分布にモデリングされた検索語要約情報及びＩＰ要約情報のうち少なくとも一つに対して中心から離れた度合によって異常の度合を点数として計算する段階と、前記計算された点数が所定の基準値以上である検索語要約情報及びＩＰ要約情報のうち少なくとも一つに異常行為が含まれたと判断する段階と、を含み、この時、前記異常行為検出段階は、前記計算段階以前に、前記モデリングされた検索語要約情報及びＩＰ要約情報のうち少なくとも一つの次元を縮小し、データを圧縮する段階をさらに含むことができる。 On the other hand, the abnormal action detection step includes calculating the degree of abnormality as a score by the degree away from the center with respect to at least one of the search word summary information and the IP summary information modeled in the multidimensional distribution, Determining that an abnormal action is included in at least one of the search word summary information and the IP summary information whose calculated score is equal to or greater than a predetermined reference value, wherein the abnormal action detecting step includes The method may further include compressing data by reducing at least one dimension of the modeled search word summary information and the IP summary information before the calculation step.

また、前記計算段階は、縮小された次元の互いに独立している標準正規分布のサンプルの和を通じてモデリングされる統計値を用いて所定基準値に対する割合として異常の度合に対する点数を計算する。 Further, the calculation step calculates a score for the degree of abnormality as a ratio with respect to a predetermined reference value using a statistical value modeled through a sum of samples of a standard normal distribution independent of each other in a reduced dimension.

また、前記訂正段階は、分布の差を測定する情報理論を適用した減点ロジックを用いて、異常行為が検出された前記検索語要約情報及びＩＰ要約情報のうち少なくとも一つから異常行為を除去する段階を含む。 In the correction step, the abnormal action is removed from at least one of the search word summary information and the IP summary information in which the abnormal action is detected, using a deduction logic applying an information theory for measuring a distribution difference. Including stages.

上述した目的を達成するための本発明の他の側面による検索ログ悪用防止装置は、検索ログから異常行為を検査する対象を選別する前処理部と、前記選別された対象に対して正常から外れた度合を点数化し、異常行為を検出する異常行為検出部と、所定の減点ロジックを用いて前記検索ログから前記検出された異常行為を除去し、前記検索ログを訂正する異常行為訂正部と、を含む。 A search log abuse prevention device according to another aspect of the present invention for achieving the above-described object is a pre-processing unit that selects a target to be inspected for abnormal behavior from the search log, and the selected target is not normal. An abnormal action detection unit for scoring the degree and detecting an abnormal action; and removing the detected abnormal action from the search log using a predetermined deduction logic, and correcting the search log; including.

本発明の一実施例による検索ログ悪用防止装置の概略ブロック図である。1 is a schematic block diagram of a search log abuse prevention device according to an embodiment of the present invention. FIG. 本発明の一実施例による検索ログ悪用防止方法を示すフローチャートである。3 is a flowchart illustrating a search log misuse prevention method according to an embodiment of the present invention. 本発明の一実施例による検査対象選別過程の詳細を示すフローチャートである。It is a flowchart which shows the detail of the test object selection process by one Example of this invention. 本発明の一実施例による異常行為検出過程の詳細を示すフローチャートである。4 is a flowchart illustrating details of an abnormal action detection process according to an exemplary embodiment of the present invention. 異常行為検出過程で用いられる統計方法を説明するための参考図である。It is a reference figure for demonstrating the statistical method used in an abnormal action detection process. 本発明の一実施例による検索ログ訂正過程の詳細を示すフローチャートである。5 is a flowchart illustrating details of a search log correction process according to an exemplary embodiment of the present invention. 検索ログ訂正過程で用いられる本発明の一実施例による減点ロジックを示す図である。FIG. 6 is a diagram illustrating a deduction logic according to an embodiment of the present invention used in a search log correction process. 本発明の一実施例によるユーザーインターフェース画面を示す図である。FIG. 6 is a diagram illustrating a user interface screen according to an exemplary embodiment of the present invention. 本発明の一実施例による検索ログ悪用防止装置の性能実験結果を示す図である。It is a figure which shows the performance experiment result of the search log abuse prevention apparatus by one Example of this invention. 本発明の一実施例による検索ログ悪用防止装置の性能実験結果を示す図である。It is a figure which shows the performance experiment result of the search log abuse prevention apparatus by one Example of this invention. 本発明の一実施例による検索ログ悪用防止装置の性能実験結果を示す図である。It is a figure which shows the performance experiment result of the search log abuse prevention apparatus by one Example of this invention. 本発明の一実施例による検索ログ悪用防止装置の性能実験結果を示す図である。It is a figure which shows the performance experiment result of the search log abuse prevention apparatus by one Example of this invention. 本発明の一実施例による検索ログ悪用防止装置の性能実験結果を示す図である。It is a figure which shows the performance experiment result of the search log abuse prevention apparatus by one Example of this invention.

以下、添付の図面を参照しつつ、本発明の好適な実施例について詳細に説明する。本発明を説明する上で、関連している公知機能または構成についての具体的な説明が本発明の要旨を曖昧にすると判断される場合には適宜省略するものとする。また、後述される用語は本発明における機能を考慮して定義されたもので、これらはユーザー、運用者の意図または慣例などによって異なってくることができる。したがって、各用語は、本明細書全般にわたる内容に基づいて定義されるべきである。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, if it is determined that a specific description of a related known function or configuration will obscure the gist of the present invention, it will be omitted as appropriate. Further, terms to be described later are defined in consideration of the functions in the present invention, and these can be different depending on the user, the intention of the operator, the custom, or the like. Accordingly, each term should be defined based on the content throughout this specification.

図１は、本発明の一実施例による検索ログ悪用防止装置の概略ブロック図である。同図で、検索ログ悪用防止装置は、前処理部１０、異常行為検出部２０、及び異常行為訂正部３０を含む。 FIG. 1 is a schematic block diagram of a search log abuse prevention apparatus according to an embodiment of the present invention. In FIG. 1, the search log abuse prevention device includes a preprocessing unit 10, an abnormal action detection unit 20, and an abnormal action correction unit 30.

前処理部１０は、検索ログから異常行為を検査する対象を選別する。ここで、検索ログ全体に対して異常行為を検査するのではなく、前処理部１０を通じて異常行為を検査する対象を選別する理由は、検索語の入力されたＩＰの数や検索語の数、またはそれらの組合せを考慮すると検索ログの数が多すぎになるためである。 The preprocessing unit 10 selects a target to be checked for abnormal behavior from the search log. Here, the reason for selecting the object to be inspected for abnormal action through the pre-processing unit 10 instead of inspecting the abnormal action for the entire search log is that the number of IPs for which the search word is input, the number of search words, Alternatively, considering the combination thereof, the number of search logs becomes too large.

このために、前処理部１０はまず、検査時点に注目されるＩＰ及び検索語候補を生成し、検査段階で使用する入力値を生成する。 For this purpose, the preprocessing unit 10 first generates an IP and a search word candidate that are noted at the time of inspection, and generates an input value to be used in the inspection stage.

前処理部１０は、検索ログから所定の時間ウィンドウ内に含まれた特定検索語のＩＰ別入力回数ベクトル及び／または特定ＩＰにおける各検索語の入力回数ベクトルを生成し、生成された各入力回数ベクトルの次元を縮小し、検索語要約情報及び／またはＩＰ要約情報を生成する。 The pre-processing unit 10 generates, from the search log, an input frequency vector for each specific search word included in a predetermined time window and / or an input frequency vector for each search word in the specific IP, and each generated input frequency Reduce the dimension of the vector and generate search term summary information and / or IP summary information.

このように、前処理部１０は、特定検索語のＩＰ別入力回数を統計的に解析した検索語要約情報、特定ＩＰにおける各検索語の入力回数を統計的に解析したＩＰ要約情報、またはこれらの組合せを生成し、生成された検索語要約情報及び／またはＩＰ要約情報は統計的方法を用いて多次元分布（distribution）にモデリングされることができる。 As described above, the pre-processing unit 10 performs the search word summary information obtained by statistically analyzing the number of times the specific search word is input by IP, the IP summary information obtained by statistically analyzing the number of times each search word is input in the specific IP, or these And the generated search term summary information and / or IP summary information can be modeled into a multi-dimensional distribution using statistical methods.

一方、本発明の他の実施例として、異常行為を検査する対象を減らすために、本出願人により先出願された韓国登録特許第５２２０２９号に記載された“実時間急上昇検索語検出方法及び実時間急上昇検索語検出システム”の概念を適用し、ある程度注目されている検索語及び／またはＩＰのみを検査対象として選定することができる。 On the other hand, as another embodiment of the present invention, in order to reduce the number of subjects to be inspected for abnormal acts, the “real-time rapid rise search word detection method and the actual By applying the concept of “prompt search term detection system”, it is possible to select only search terms and / or IPs that have attracted some attention as inspection targets.

異常行為検出部２０は、前処理部１０により選別された対象に対して正常から外れた度合を点数化し、選別された対象から異常行為を検出する。すなわち、統計方法論に基盤した点数技法を導入し、ＩＰ別及び／または検索語別異常行為に対する点数算定手順を行なう。 The abnormal action detection unit 20 scores the degree of deviating from the target selected by the preprocessing unit 10 and detects the abnormal action from the selected target. That is, a score technique based on statistical methodology is introduced, and a score calculation procedure for abnormal actions by IP and / or search terms is performed.

異常行為検出部２０は、統計的方法を用いて多次元分布（distribution）にモデリングされた検索語要約情報及び／またはＩＰ要約情報に対して中心から離れた度合によって異常の度合を点数として計算し、計算された点数が所定の基準値以上である検索語要約情報及び／またはＩＰ要約情報に異常行為が含まれたと判断する。この時、データ処理の効率を上げるために、点数を計算する前に、モデリングされた検索語要約情報及び／またはＩＰ要約情報の次元を縮小することでデータを圧縮して処理することができる。 The abnormal action detection unit 20 calculates the degree of abnormality as a score by the degree away from the center with respect to the search word summary information and / or the IP summary information modeled in a multidimensional distribution using a statistical method. Then, it is determined that the search term summary information and / or IP summary information whose calculated score is equal to or greater than a predetermined reference value includes an abnormal action. At this time, in order to increase data processing efficiency, the data can be compressed and processed by reducing the dimension of the modeled search word summary information and / or IP summary information before calculating the score.

異常行為訂正部３０は、所定の減点ロジックを用いて、異常行為検出部２０により検出された異常行為を検索ログから除去することによって検索ログを訂正する。一実施例において、異常行為訂正部３０は、分布の差を測定する情報理論を適用した減点ロジックを利用することによって、異常行為が検出された検索語要約情報及び／またはＩＰ要約情報から汚染部分を除去することができる。すなわち、減点ロジックを用いて異常行為の検索回数を減点し、検索ログから正常行為のみを残す手順を行なう。これにより、不正意図による検索語悪用（Abusing）行為を検出及び治療し、検索ログをきれいに維持することができる。 The abnormal action correcting unit 30 corrects the search log by removing the abnormal action detected by the abnormal action detecting unit 20 from the search log using a predetermined deduction logic. In one embodiment, the abnormal action correcting unit 30 uses a deduction logic to which information theory for measuring a distribution difference is applied, thereby causing a contaminated part from the search word summary information and / or the IP summary information in which the abnormal action is detected. Can be removed. That is, a procedure for deducting the number of abnormal action searches using deduction logic and leaving only normal actions from the search log is performed. Accordingly, it is possible to detect and treat a search word abuse (Abusing) act due to fraudulent intentions, and to maintain a clean search log.

以下、上述した本発明の一実施例による検索ログ悪用防止装置の構成に基づいて本発明の一実施例による検索ログ悪用防止方法を詳細に説明する。 Hereinafter, a search log abuse prevention method according to an embodiment of the present invention will be described in detail based on the configuration of the search log abuse prevention apparatus according to an embodiment of the present invention described above.

図２は、本発明の一実施例による検索ログ悪用防止方法を示すフローチャートである。 FIG. 2 is a flowchart illustrating a search log misuse prevention method according to an embodiment of the present invention.

図２を参照すると、検索ログの悪用を防止するために、まず、検索ログから異常行為を検査する対象を選別する（Ｓ１００）。すなわち、一実施例において、検索ログから異常行為を検査する対象として、特定検索語のＩＰ別入力回数を統計的に解析した検索語要約情報及び／または特定ＩＰにおける各検索語の入力回数を統計的に解析したＩＰ要約情報を選別する。この時、このような検索語要約情報とＩＰ要約情報は統計的方法を用いて多次元分布（Distribution）にモデリングされることができる。 Referring to FIG. 2, in order to prevent abuse of the search log, first, a target to be checked for abnormal behavior is selected from the search log (S100). That is, in one embodiment, as an object for inspecting abnormal acts from a search log, search word summary information obtained by statistically analyzing the number of times of input of a specific search word by IP and / or the number of times of input of each search word in a specific IP are statistically analyzed. The IP summary information that was analyzed manually. At this time, the search term summary information and the IP summary information can be modeled into a multi-dimensional distribution using a statistical method.

次に、選別された検索語要約情報及び／またはＩＰ要約情報に対して正常から外れた度合を点数化し、異常行為を検出する（Ｓ２００）。一実施例において、所定の減点ロジックを用いて検出された異常行為を除去することによって検索ログを訂正する段階（Ｓ３００）をさらに含むことができる。 Next, the degree to which the selected search word summary information and / or IP summary information is not normal is scored, and an abnormal action is detected (S200). In one embodiment, the method may further include correcting the search log by removing the abnormal action detected using a predetermined deduction logic (S300).

次に、検査対象を選別する過程について図３を参照してより具体的に説明する。 Next, the process of selecting the inspection object will be described more specifically with reference to FIG.

図３を参照すると、異常行為を検査する対象を選別するために、まず、検索ログから所定の時間ウィンドウ内で、特定検索語のＩＰ別入力回数ベクトル及び／または特定ＩＰにおける各検索語の入力回数ベクトルを生成する（Ｓ１１０）。その後、生成された各入力回数ベクトル、すなわち、特定検索語のＩＰ別入力回数ベクトル及び／または特定ＩＰにおける各検索語の入力回数ベクトルの次元を縮小し、検索語要約情報及び／またはＩＰ要約情報を生成する（Ｓ１２０）。 Referring to FIG. 3, in order to select an object to be inspected for abnormal behavior, first, an input count vector for each specific search word and / or each search word in the specific IP is input from a search log within a predetermined time window. A frequency vector is generated (S110). Thereafter, the generated input frequency vector, that is, the input frequency vector for each IP of the specific search word and / or the dimension of the input frequency vector of each search word in the specific IP is reduced, and the search word summary information and / or IP summary information is reduced. Is generated (S120).

以下、上述した検査対象選別過程を具体的な実施例に挙げてより詳細に説明する。ただし、下記の実施例は検査対象選別方法の一例に過ぎず、様々な変形が可能であることはもちろんである。 Hereinafter, the above-described inspection object selection process will be described in more detail using specific examples. However, the following embodiment is merely an example of the inspection object selection method, and it is needless to say that various modifications are possible.

１．第１段階−前処理段階
検索語悪用調査のために検索ログＤＢからＩＰ要約情報及び検索語要約情報を生成する必要がある。一つのＩＰは、特定時間に多数の検索語を入力する。このＩＰが行なう検索の様相が、他の普通のＩＰが行なう検索の様相と異なる度合を測定するためにＩＰ要約情報を生成する必要がある。また、一つの検索語は様々なＩＰから入力される。したがって、該当の検索語を入力したＩＰに関する要約情報を生成する必要がある。 1. First stage-preprocessing stage It is necessary to generate IP summary information and search word summary information from the search log DB for a search word abuse investigation. One IP inputs a large number of search terms at a specific time. It is necessary to generate IP summary information in order to measure the degree to which the search aspect performed by this IP differs from the search aspect performed by other ordinary IPs. One search term is input from various IPs. Therefore, it is necessary to generate summary information about the IP in which the corresponding search term is input.

しかし、ＩＰの数、検索語の数、及びこれらの組合せは非常にぼう大なので、検査の対象となるＩＰ及び検索語を選別する必要がある。これらを全て処理するにはメモリー問題が生じるわけである。 However, since the number of IPs, the number of search terms, and a combination thereof are very large, it is necessary to select IPs and search terms to be inspected. Processing all of these results in memory problems.

１）入力ベクトルの表現
ＩＰ及び検索語要約情報を生成するために、まず、下記のようなベクトル表現を導入することができる。 1) Representation of input vector In order to generate IP and search word summary information, first, the following vector representation can be introduced.

全体ＩＰ数をＮ_Ｉ、全体検索語数Ｎ_Ｑとすれば、特定検索時点にあらかじめ定義された時間ウィンドウＷ内の特定ＩＰに対する情報は、次のように特定ＩＰで各検索語が入力された回数のベクトルで表すことができる。 If the entire IP number N _I, the entire search word count N _Q, the information for a specific IP in the predefined time window W to a specific search point, the number of times each search word a specific IP as follows is input It can be expressed by a vector of

ここで、 here,

は、特定ＩＰでｋ番目の検索語が入力された回数を意味する。 Means the number of times the k-th search term is input with the specific IP.

同様に、特定検索語に関する情報は、各ＩＰで特定検索語が入力された回数のベクトルで表すことができる。 Similarly, the information related to the specific search word can be expressed by a vector of the number of times the specific search word is input in each IP.

ここで、 here,

は、ｋ番目のＩＰで特定検索語を入力した回数を意味する。 Means the number of times that a specific search word is inputted with the k-th IP.

しかし、全体ＩＰ数Ｎ_Ｉ及び全体検索語数Ｎ_Ｑが非常にぼう大なため、上記ベクトル表現を全て維持するにはメモリー問題にぶつからざるを得ない。 However, since the entire IP number N _I and the entire search word number N _Q is very defense large, to maintain all of the vector representation can not help hitting the memory problems.

２）ハッシュバケツを用いた検査対象ＩＰ及び検索語の選別
一方、特定時間ウィンドウＷ内で特定ＩＰで入力された検索語のうち、互いに異なる検索語の数は、全体検索語の数Ｎ_Ｑに比較するとごく少数に過ぎない。また、特定時間ウィンドウＷ内で特定検索語を入力したＩＰのうち、互いに異なるＩＰの数も全体ＩＰ数Ｎ_Ｉに比べるとごく少数に過ぎない。このような特性を用いて特定ＩＰに関する要約情報及び特定検索語に関する要約情報を生成することによって、前述したメモリー問題を解決することができる。すなわち、全体検索語の数または全体ＩＰ数よりはごく少ないバケツの個数を持つハッシュバケツ（hashed bucket）を利用する。 2) Selection of inspection target IP and search terms using hash bucket On the other hand, among the search terms input by the specific IP within the specific time window W, the number of different search terms is the number N _Q of the total search terms. There are only a few in comparison. In addition, of the IP you enter a specific search terms within a specific time window W, only a small minority compared to the different IP number also the entire IP number N _I of each other. By generating summary information about a specific IP and summary information about a specific search word using such characteristics, the memory problem described above can be solved. That is, a hash bucket having a number of buckets that is much smaller than the total number of search words or the total number of IPs is used.

バケツの個数 D << Ｎ_Ｉ, Ｎ_Ｑとすれば、特定ＩＰの要約情報は、下記のようにハッシュバケツの回数ベクトルで表現することができる。 Bucket number D << N _I, if N _Q, summary information of a specific IP can be expressed by the count vector of hash buckets as described below.

ここで、 here,

は、特定ＩＰでｋ番目のバケツがヒットされた数を意味し、特定ＩＰが入力した検索語ｑがある時、検索語ｑと関連したバケツのインデックスｋをハッシュ関数を用いて次のように計算する。 Means the number of hits of the k-th bucket for a specific IP. When there is a search word q input by the specific IP, the index k of the bucket associated with the search word q is calculated using a hash function as follows: calculate.

次に、計算されたインデックスｋに該当するバケツのカウントを増加させる。 Next, the count of buckets corresponding to the calculated index k is increased.

このような過程により、特定ＩＰに対する情報を、上記式５に表現されたように、バケツの個数Ｄだけの長さを持つベクトルで要約して表現することによってＩＰ要約情報を生成することができる。また、同様に、検索語に関する情報をバケツの個数Ｄだけの長さを持つベクトルで要約して表現することによって検索語の要約情報も生成することができる。 Through this process, the IP summary information can be generated by summarizing and expressing the information about the specific IP with a vector having a length corresponding to the number of buckets D as expressed in the above equation 5. . Similarly, search term summary information can also be generated by summarizing and expressing information about a search term with a vector having a length corresponding to the number of buckets D.

これにより、ＩＰ情報及び検索語情報を、全体ＩＰ数Ｎ_Ｉ及び全体検索語数Ｎ_Ｑよりごく少ないバケツの数Ｄだけの長さを持つベクトルで要約して表現することによって、メモリー問題を解決することができる。 Thus, the IP information and keyword information, by expressing summarized by a vector with a length of only a few D of very few buckets from whole IP number N _I and the entire search word number N _Q, the solution to memory problems be able to.

一方、上述した過程を通じて生成された検索語要約情報とＩＰ要約情報は、統計的方法を用いて多次元分布にモデリングされることができる。 Meanwhile, the search term summary information and the IP summary information generated through the above-described process can be modeled into a multidimensional distribution using a statistical method.

以下、上記の式５に表現されたハッシュバケツを用いたベクトル表現に基盤して異常行為の度合を点数化する方法を、異常行為検出過程の流れを詳細に示す図１４を参照して具体的に説明する。 Hereinafter, a method of scoring the degree of abnormal action based on the vector expression using the hash bucket expressed in Equation 5 above will be described with reference to FIG. 14 showing the flow of the abnormal action detection process in detail. Explained.

図４を参照すると、異常行為を検出するために、まず、統計的方法を用いて多次元分布にモデリングされた検索語要約情報及び／またはＩＰ要約情報の次元を縮小することでデータを圧縮する（Ｓ２１０）。一実施例において、データを圧縮するための方法として、入力データを互いに直交する座標系に写像（Mapping）させる主成分分析（Principal Component Analysis：以下、‘ＰＣＡ’と略す。）を利用することができる。 Referring to FIG. 4, in order to detect anomalous behavior, data is first compressed by reducing the dimension of search term summary information and / or IP summary information modeled in a multidimensional distribution using statistical methods. (S210). In one embodiment, as a method for compressing data, a principal component analysis (Principal Component Analysis: hereinafter abbreviated as “PCA”) that maps input data to mutually orthogonal coordinate systems is used. it can.

続いて、縮小された次元の検索語要約情報及び／またはＩＰ要約情報に対して、中心から離れた度合によって異常の度合を点数として計算する（Ｓ２２０）。一実施例において、異常の度合を計算するために、縮小された次元の互いに独立した標準正規分布のサンプルの和を通じてモデリングされる統計値を用いて所定基準値に対する割合として異常の度合に対する点数を計算することができる。 Subsequently, the degree of abnormality is calculated as a score according to the degree away from the center for the reduced-dimension search word summary information and / or IP summary information (S220). In one embodiment, in order to calculate the degree of anomaly, the score for the degree of anomaly is calculated as a percentage of a predetermined reference value using statistics modeled through the sum of samples of a standard normal distribution independent of each other in a reduced dimension. Can be calculated.

最後に、計算された点数が所定の基準値以上である検索語要約情報及び／またはＩＰ要約情報に異常行為が含まれたと判断する（Ｓ２３０）。すなわち、計算された点数が基準値以上である検索語要約情報及び／またはＩＰ要約情報を異常行為として検出する。 Finally, it is determined that the search term summary information and / or IP summary information whose calculated score is equal to or greater than a predetermined reference value includes an abnormal action (S230). That is, search term summary information and / or IP summary information whose calculated score is greater than or equal to a reference value is detected as an abnormal action.

以下、前述した異常行為検出過程を、具体的な実施例に上げてより詳細に説明する。下記の実施例は異常行為検出方法の一例に過ぎず、様々な変形が可能であることはもちろんである。 Hereinafter, the abnormal action detection process described above will be described in more detail with reference to specific examples. The following embodiment is merely an example of the abnormal action detection method, and it is needless to say that various modifications are possible.

２．第２段階−異常行為検出段階
上記の式５で表現されたように、ＩＰ要約情報及び検索語要約情報はそれぞれ、特定ＩＰにおける各検索語の入力回数情報及び特定検索語のＩＰ別入力回数情報を元素とするベクトルで表現できる。
このベクトルを 2. Second Stage-Abnormal Action Detection Stage As expressed in Equation 5 above, the IP summary information and the search word summary information are respectively the number-of-inputs information of each search word and the number-of-inputs information by IP of the specific search word in the specific IP. Can be expressed as a vector with element as.
This vector

とすれば、これは離散確率分布（Discrete distribution）を見せ、 This shows a discrete distribution,

で表現できる。 Can be expressed as

ここで、ｐは確率ベクトルで、下記のように計算される。 Here, p is a probability vector and is calculated as follows.

最終的に、本発明では確率ベクトルｐを用いてＩＰ要約情報及び／または検索語要約情報を、下記のような確率ベクトルの集合で表現する。 Finally, in the present invention, the IP summary information and / or the search word summary information is expressed by a set of probability vectors as follows using the probability vector p.

以下では、上記の式１１のように確率ベクトルｐを用いて表現される検索語要約情報及び／またはＩＰ要約情報に対して正常行為から外れた度合を点数化する方法を提案する。 Hereinafter, a method for scoring the degree of deviating from the normal action with respect to the search word summary information and / or the IP summary information expressed using the probability vector p as shown in Equation 11 above will be proposed.

１）主成分分析を用いたデータ圧縮
本発明の一実施例によれば、より円滑なデータ処理のためにデータ圧縮過程を行なう。具体的には、ＰＣＡを用いてバケツの個数であるＤ次元を縮小することによってデータを圧縮する。すなわち、この方法は、ＩＰ要約情報または検索語要約情報を表す離散確率分布 1) Data Compression Using Principal Component Analysis According to one embodiment of the present invention, a data compression process is performed for smoother data processing. Specifically, data is compressed by reducing the D dimension, which is the number of buckets, using PCA. That is, this method uses a discrete probability distribution representing IP summary information or search term summary information.

から、写像された値の分散を大きくする主成分ベクトルを探すもので、これは該当の離散確率分布の特徴を最もよく説明する数個の固有ベクトルを探すということを意味する。 From this, the principal component vector that increases the variance of the mapped values is searched, which means searching for several eigenvectors that best explain the characteristics of the corresponding discrete probability distribution.

このようなＰＣＡ方法において、主成分ベクトルとしては、全体分散のうち該当の離散確率分布の分散をよく説明するｄ個の主成分ベクトル（ここで、ｄ＜Ｄである）のみを利用することが一般的である。この時、ｄ個の主成分ベクトルに写像された入力データにおいて、各成分ごとに互いに異なる分散で写像された値間の相関関係（correlation）は存在しないし、各主成分ベクトルは直交することとなる。ＰＣＡ方法は既に広く知られた公知の方法を使用するので、ＰＣＡについての具体的な説明は省略する。 In such a PCA method, as the principal component vector, only d principal component vectors (here, d <D) that sufficiently explain the variance of the corresponding discrete probability distribution among the total variances may be used. It is common. At this time, in the input data mapped to the d principal component vectors, there is no correlation between values mapped with different variances for each component, and each principal component vector is orthogonal. Become. Since the PCA method uses a well-known method that is already widely known, a specific description of PCA is omitted.

このようなＰＣＡ方法を用いてバケツの個数がＤ次元だったＩＰ要約情報または検索語要約情報を表す離散確率分布を、それよりはるかに少ない数のｄ次元にその次元を縮小することでデータを圧縮し、データ処理効率を上げることができる。 Using such a PCA method, the data is obtained by reducing the dimensionality of a discrete probability distribution representing IP summary information or search word summary information in which the number of buckets is D-dimensional to a much smaller number of d-dimensions. Data processing efficiency can be increased by compressing.

以下、主成分分析されたｄ次元の入力データを用いて正常行為から外れた度合を点数化する方法について具体的に説明する。 Hereinafter, a method for scoring the degree of deviation from the normal action using the d-dimensional input data subjected to principal component analysis will be described in detail.

２）異常の度合を測定する点数化方法
前述したＰＣＡ方法を通じてｄ次元の主成分ベクトルに写像された入力データは、各成分ごとに互いに異なる分散を有することがわかる。これは、各次元ごとにスケーリングが異なるということを意味する。この場合、視覚化及び後処理に役立つように各次元ごとに分散が１となるように主成分ベクトルをスケーリングするプリホワイトニング技法（Prewhitening Method）を利用することができる。
プリホワイトニングされた写像行列 2) Scoring method for measuring the degree of anomaly It can be seen that the input data mapped to the d-dimensional principal component vector through the PCA method described above has a different variance for each component. This means that the scaling is different for each dimension. In this case, it is possible to use a prewhitening method that scales the principal component vector so that the variance becomes 1 for each dimension so as to be useful for visualization and post-processing.
Prewhitened mapping matrix

がある時、これに対する入力ベクトルｘの写像値をｄ次元のベクトル When there is a mapping value of the input vector x to this, a d-dimensional vector

で表現するとする。この時、 It is expressed as At this time,

と When

は互いに相関関係がないし、分散 Are uncorrelated with each other and distributed

である。 It is.

さて、本発明によって異常行為を点数化するために下記のように仮定する。
１）各 Now, in order to score abnormal acts according to the present invention, the following assumptions are made.
1) Each

は、標準正規分布Ｎ（０,１）に従う。
２）i≠ｊの時、 Follows the standard normal distribution N (0,1).
2) When i ≠ j,

と When

は互いに独立している。 Are independent of each other.

ここで、一般に、相関関係がないということが互いに独立しているということを意味するわけではないが、本発明では、データ処理の効率を上げるために強い仮定を使用する。 Here, in general, the absence of correlation does not mean that they are independent of each other, but the present invention uses strong assumptions to increase the efficiency of data processing.

このような仮定の下に、下記のような統計値を定義することができる。 Under such assumptions, the following statistical values can be defined.

一般に、統計学では自由度ｄのカイ二乗分布 In general, in statistics, the chi-square distribution with d degrees of freedom

は、ｄ個の互い独立している標準正規分布のサンプルの和を通じてモデリングされる。したがって、上記の式２１のような仮定の下に、統計値 Is modeled through the sum of d independent samples of a standard normal distribution. Therefore, under the assumption of Equation 21 above, the statistical value

は自由度ｄのカイ二乗分布に従うと見なすことができる。
以下では、臨界値αに対して、 Can be considered to follow a chi-square distribution with d degrees of freedom.
In the following, for the critical value α,

を満足する最も小さいｓ値を The smallest s value that satisfies

と定義する。ここで It is defined as here

はｓ境界までの累積確率分布値を表し、αは誤差水準または有意水準で、通常、０．０５または０．０１とすることが好ましい。結局、 Represents a cumulative probability distribution value up to the s boundary, and α is an error level or a significance level, and is preferably 0.05 or 0.01. After all,

は臨界値αを越えない正常範囲の最大境界を意味することから、境界 Means the maximum boundary of the normal range that does not exceed the critical value α.

を越える全ての All beyond

は異常の範囲に含まれると考えることができる。 Can be considered to be included in the range of anomalies.

したがって、本発明では正常行為から外れた度合を点数化するために、悪用点数を下記の式で定義する。 Therefore, in the present invention, in order to score the degree of deviating from the normal action, the abuse score is defined by the following equation.

すなわち、 That is,

が１より大きい値を持つほど、確率ベクトル The greater the value of, the higher the probability vector

は小さい値を有し、確率ベクトルｐが臨界値αよりも小さくなる。結局、これは、与えられた仮定の下に極めて稀であると判定する根拠を提供する。すなわち、上記の式３０によって定義された悪用点数（score）値が１より大きい場合は、正常範囲から外れた稀な場合で、これは異常行為と判定されることができる。
上述した統計方法を、図５の例を参照して具体的に説明する。図５には、自由度１のカイ二乗分布の一例が示されている。 Has a small value and the probability vector p is smaller than the critical value α. After all, this provides a basis for determining that it is extremely rare under the given assumptions. In other words, if the score value defined by Equation 30 above is greater than 1, it is a rare case that falls outside the normal range, and this can be determined as an abnormal action.
The statistical method described above will be specifically described with reference to the example of FIG. FIG. 5 shows an example of a chi-square distribution with one degree of freedom.

は、誤差水準または有意水準を表す臨界値がαの時のカイ二乗分布での正常範囲の最大境界９０２を意味するから、境界 Means the maximum boundary 902 of the normal range in the chi-square distribution when the critical value representing the error level or significance level is α.

を越える全ての All beyond

は異常範囲に含まれると考えることができる。
すなわち、確率ベクトル１から正常境界範囲までの累積確率分布 Can be considered to be included in the abnormal range.
That is, cumulative probability distribution from probability vector 1 to normal boundary range

を減算した領域９０４が異常領域を意味し、この領域９０４に含まれる全ての The area 904 obtained by subtracting the ”means an abnormal area, and all the areas included in this area 904 are

は異常範囲に含まれると考えることができる。 Can be considered to be included in the abnormal range.

次に、図６を参照して本発明の一実施例による検索ログ訂正過程について詳細に説明する。 Next, a search log correction process according to an embodiment of the present invention will be described in detail with reference to FIG.

図６を参照すると、検索ログを訂正するために、分布の差を測定する情報理論を適用した減点ロジックを用いて異常行為として検出された検索語要約情報及び／またはＩＰ要約情報から汚染部分を除去する（Ｓ３１０）。 Referring to FIG. 6, in order to correct a search log, a contaminated portion is detected from search word summary information and / or IP summary information detected as anomalous activity using a deduction logic applying information theory for measuring a distribution difference. It is removed (S310).

この時、検索ログを訂正するための減点ロジック（Discounting Logic）は、母集団の確率模型と前述の異常行為が検出された検索語要約情報及び／またはＩＰ要約情報の確率模型間の分布の差を表すＫＬ距離（Kullback-Leibler Distance）を用いて異常行為を除去することができる。 At this time, the deduction logic for correcting the search log is the difference in distribution between the probability model of the population and the probability model of the search word summary information and / or IP summary information in which the abnormal action is detected. An abnormal action can be removed using a KL distance (Kullback-Leibler Distance).

以下、前述の検索ログ訂正過程を具体的な実施例に上げて詳細に説明する。下記の実施例は、検索ログ訂正方法の一例に過ぎず、様々な変形が可能であることはもちろんである。 Hereinafter, the search log correction process will be described in detail with reference to a specific embodiment. The following embodiment is merely an example of a search log correction method, and it goes without saying that various modifications are possible.

３．第３段階−検索ログ訂正段階
１）分布の差を測定する手段−ＫＬ距離
上述したように、本発明の一実施例によって検索ログ訂正のために利用される減点ロジックは、異常行為が検出された検索語要約情報及び／またはＩＰ要約情報の確率模型と母集団の確率模型間の分布の差を測定する手段としてＫＬ距離を利用する。 3. Third Stage—Search Log Correction Stage 1) Means for Measuring Distribution Difference—KL Distance As described above, the deduction logic used for search log correction according to one embodiment of the present invention detects abnormal actions. The KL distance is used as a means for measuring the distribution difference between the probability model of the search word summary information and / or the IP summary information and the probability model of the population.

このようなＫＬ距離は情報理論（Information Theory）に根拠しているもので（Cover and Thomas(1991)）、例えば、２つの分布ｐ，ｑがあるとすれば、これら両分布間のＫＬ距離は下記のように求めることができる。 Such KL distance is based on Information Theory (Cover and Thomas (1991)). For example, if there are two distributions p and q, the KL distance between these two distributions is It can be obtained as follows.

したがって、ＫＬ距離は、両分布が同じ時に０の値を持つ。 Therefore, the KL distance has a value of 0 when both distributions are the same.

２）減点ロジック
便宜上、モデルを構成するために使われたＮ個のデータを母集団とし、これをＮＸＤの行列Ｍで表現する。この時、ＭのＩ番目の行m_iは、ハッシュバケツの回数を保存したベクトルである。行列Ｍを行を基準にして正規化（normalization）し、離散確率模型ｍを得る。 2) Deduction logic For convenience, the N pieces of data used to construct the model are used as a population, and this is represented by an NXD matrix M. In this case, I-th row m _i of M is the vector saved the number of hash buckets. The matrix M is normalized on the basis of the rows to obtain a discrete probability model m.

異常パターンのハッシュバケツベクトルをｈ、その離散確率模型をｐとすれば、母集団の離散確率分布ｍと検査対象となる離散確率分布ｐ間のＫＬ距離は、下記のように計算される。 If the hash bucket vector of the abnormal pattern is h and its discrete probability model is p, the KL distance between the discrete probability distribution m of the population and the discrete probability distribution p to be examined is calculated as follows.

上記式４０を用いてハッシュバケツベクトルｈで特定元素の値を減らすと、変形された離散確率模型と母集団の離散確率模型間の差を減らすことができる。
具体的には、あるハッシュバケツｉの When the value of the specific element is reduced by the hash bucket vector h using the above equation 40, the difference between the modified discrete probability model and the population discrete probability model can be reduced.
Specifically, a hash bucket i

値が大きい正の値を持つほど、両分布間のＫＬ距離が大きくなり、これは分布ｐを異常なものとさせる。したがって、異常行為を除去することで検索ログをきれいに維持するために、臨界値をβとする時、 The larger the value, the greater the KL distance between the two distributions, which makes the distribution p abnormal. Therefore, in order to keep the search log clean by removing abnormal behavior, when the critical value is β,

のハッシュバケツが減点ロジックを適用する校正候補となる。 The hash bucket is a proofreading candidate to apply the deduction logic.

一方、図７は、本発明の一実施例によって検索ログ訂正過程に用いられる減点ロジックを説明するための図である。
図７には全体的な減点ロジックが示されている。ここで、“find()”関数は、()中の条件を満たす元素のインデックスを取り戻す関数である。“ceil()”関数は、()中の因子よりも大きい最も小さい整数を取り戻す関数である。演算子“.*”は、ベクトルの元素間の乗算を行なう。“score”は、上記の式３０で定義した悪用点数を意味する。 On the other hand, FIG. 7 is a diagram for explaining a deduction logic used in the search log correction process according to an embodiment of the present invention.
FIG. 7 shows the overall deduction logic. Here, the “find ()” function is a function that retrieves the index of the element that satisfies the condition in (). The “ceil ()” function is a function that retrieves the smallest integer larger than the factor in (). The operator “. *” Performs multiplication between elements of a vector. “Score” means the number of abuse points defined in Equation 30 above.

は検索語入力回数を、ｐは Is the number of search term input, p is

を正規化した確率関数を、βは校正候補を選定する臨界値を、ｆは母集団の離散確率分布ｍと検査対象となる離散確率分布ｐ間のＫＬ距離をそれぞれ意味する。 , Β is a critical value for selecting a calibration candidate, and f is a KL distance between the discrete probability distribution m of the population and the discrete probability distribution p to be examined.

全体的な減点ロジックについて説明すると、まず、特定検索語のＩＰ別入力回数または特定ＩＰにおける各検索語の入力回数を正規化し確率関数を求め、母集団の離散確率模型との差に基づくＫＬ距離を計算する（９０４）。求められたＫＬ距離が臨界値βよりも大きいインデックスｉを求める。求められたインデックスが異常行為の含まれた検索語またはＩＰを意味する。求められたインデックスに該当する検索回数を減少させ（９０６）、臨界値βを調整する。 The overall deduction logic will be described. First, the probability of obtaining a probability function is obtained by normalizing the number of times of input of a specific search word per IP or the number of times of input of each search word in the specific IP, and the KL distance based on the difference from the population discrete probability model. Is calculated (904). An index i in which the obtained KL distance is larger than the critical value β is obtained. The obtained index means a search term or IP including an abnormal action. The number of searches corresponding to the obtained index is decreased (906), and the critical value β is adjusted.

以上の減点ロジックは、score＜１と正常範囲に属したり臨界値β以上の候補がないまで反復する。特に反復の度にβを増加させる理由は、既に反復の初期に核心的な異常行為の減点がなされるから、次の反復では減点基準をより厳格にするためである。 The above deduction logic repeats until there is no candidate that belongs to the normal range of score <1 or exceeds the critical value β. In particular, the reason why β is increased at each iteration is that the core abnormal deductions are already made at the beginning of the iteration, so that the deduction criteria are made more strict at the next iteration.

一方、図８は、本発明の一実施例による検索ログ悪用防止装置により提供されるユーザーインターフェース画面である。 FIG. 8 is a user interface screen provided by the search log abuse prevention apparatus according to an embodiment of the present invention.

図８を参照すると、左側窓に検索対象として選別された検索語目録とＩＰ目録が表示され、中間窓には異常の度合を計算した悪用点数によって減点処理すべきカウント数が表示される。 Referring to FIG. 8, the search word list and IP list selected as search targets are displayed in the left window, and the count number to be deducted is displayed in the intermediate window according to the abuse points calculated for the degree of abnormality.

一方、図９〜図１３は、本発明の一実施例による検索ログ悪用防止方法による実験結果を示す図である。 On the other hand, FIGS. 9 to 13 are diagrams showing experimental results by the search log abuse prevention method according to one embodiment of the present invention.

本発明の一実施例による検索ログ悪用防止方法の性能を確認するために、２００６年７月７日１２時３０分頃の結果を調べる。本実験で時間ウィンドウＷは１時間、ハッシュバケツの個数Ｄは３２と設定した。また、臨界値α＝０．０１、β＝ｌｏｇ(１．８)、ｓｃａｌｅ＝ｌｏｇ(１．３)と設定した。 In order to confirm the performance of the search log misuse prevention method according to one embodiment of the present invention, the result at around 12:30 on July 7, 2006 is examined. In this experiment, the time window W was set to 1 hour, and the number D of hash buckets was set to 32. The critical values α = 0.01, β = log (1.8), and scale = log (1.3) were set.

検索語要約情報からモデルを構築し、検査候補集合に対して前述の悪用検査を行なった後、算定された悪用点数の高い上位２０個の検索語が図９のように示される。これは、各サンプルの離散確率模型をヒストグラム形態で表現したもので、縦軸は確率値であって［０，１］に軸のスケールを固定した。横軸はハッシュバケツのインデックスを表す。絵の上端には検索語の名前と悪用点数（score）を記録した。上位２０個の検索語はいずれも悪用点数が３〜９程度と、いずれも１以上であるから、異常行為が含まれたと予測される。 After constructing a model from the search word summary information and performing the above-described abuse test on the test candidate set, the top 20 search words with the highest abuse score are shown in FIG. This is a discrete probability model of each sample expressed in the form of a histogram. The vertical axis is a probability value, and the scale of the axis is fixed at [0, 1]. The horizontal axis represents the hash bucket index. At the top of the picture, the name of the search term and the score of abuse were recorded. Since the top 20 search terms all have an abuse score of about 3 to 9, and all are 1 or more, it is predicted that abnormal acts were included.

一方、図１０には、本発明によって検出された上位２０個の悪用検索語に対する減点処理結果の一例を示す。各行を見ると、減点以前の元来のハッシュバケツと、減点ロジックの反映されたハッシュバケツを対として表した。点数については、減点後には悪用点数が１未満となり、異常行為が除去されたことが確認できる。 On the other hand, FIG. 10 shows an example of a deduction processing result for the top 20 abuse search words detected by the present invention. Looking at each row, the original hash bucket before the deduction and the hash bucket reflecting the deduction logic are shown as a pair. Regarding the score, after the deduction, the abuse score is less than 1, and it can be confirmed that the abnormal action has been removed.

一方、図１１は、本発明によって上記減点処理結果と離散確率分布値を比較した結果を示す。悪用点数が３〜９程度と異常だった検索語の悪用点数が１以下となり、正常範囲に訂正されたことが確認できる。例えば、検索語“タイプ”の場合、９．６７３８３３だった悪用点数が、異常行為が除去される減点処理手順の後には０．２１１１６６程度となり、正常範囲内に訂正されたことがわかる。便宜上、減点後では縦軸を［０，０．１］にスケーリングした。 On the other hand, FIG. 11 shows a result of comparing the deduction processing result and the discrete probability distribution value according to the present invention. It can be confirmed that the abuse score of the search term that was abnormal with an abuse score of about 3 to 9 was 1 or less and was corrected to the normal range. For example, in the case of the search word “type”, the misuse score which was 9.673833 was about 0.211166 after the deduction processing procedure for removing the abnormal action, and it can be seen that it was corrected within the normal range. For convenience, the vertical axis is scaled to [0, 0.1] after deduction.

減点ロジックでＫＬ距離を計算するための基準とされた母集団の確率模型は、図１２に示す。 A population probability model used as a reference for calculating the KL distance by the deduction logic is shown in FIG.

減点前と減点後の点数を比較した時、減点ロジックを通じて異常行為が除去されることによって、異常だった検索語の悪用点数が正常水準に回復したといえる。 When comparing the score before and after deduction, it can be said that the abuse score of the search word that was abnormal returned to the normal level by removing the abnormal action through the deduction logic.

図１３には、上位４０個の検索語に対する減点処理結果の一例を示す。左側に悪用点数が記載され、減点前の総検索回数と減点ロジックによって計算された減点回数が表示されている。すなわち、総検索回数から減点ロジックによって計算された減点回数を減算することによって、異常行為として検出された検索語から汚染された部分を除去し、検索ログを訂正できることがわかる。 FIG. 13 shows an example of deduction processing results for the top 40 search terms. The number of abuse points is written on the left, and the total number of searches before deduction and the number of deductions calculated by the deduction logic are displayed. That is, by subtracting the number of deductions calculated by the deduction logic from the total number of searches, it is understood that the contaminated portion can be removed from the search term detected as an abnormal action and the search log can be corrected.

以上では検索語要約情報を用いて異常と判断された検索語の検索回数を減点し、検索ログ上に正常行為の情報のみが維持されるようにする方法について説明した。大部分の検索語悪用問題は、前述したように検索語要約情報を用いた悪用検出及び治療方法だけで十分に解決することができる。しかし、検索語要約情報では悪用点数が１未満と正常行為として判断されるが、実際には検索語悪用による異常行為である場合が稀にある。このような場合にはＩＰ要約情報を用いて悪用行為をさらに訂正することができる。このような方法は、検索語要約情報に対する悪用検出及び治療方法と略同様なので、その詳細な説明は省略する。 The method for reducing the number of searches for a search term determined to be abnormal using the search term summary information and maintaining only normal action information on the search log has been described above. Most of the search term abuse problems can be sufficiently solved only by the abuse detection and treatment method using the search term summary information as described above. However, although the abuse word score is less than 1 in the search word summary information, it is determined as a normal action. In such a case, misuse can be further corrected using the IP summary information. Since such a method is substantially the same as the abuse detection and treatment method for the search word summary information, a detailed description thereof will be omitted.

本発明では、検索ログにおける検索悪用の診断及び後処理を通じてきれいな検索ログを維持するための検索ログ悪用防止方法及び装置を提案した。すなわち、ＩＰ要約情報及び／または検索語要約情報を表現するためにハッシュバケツ基盤の資料構造を構築したし、これを離散確率模型に変換して入力データを表現した
また、正常サンプルに比べて異常サンプルを検出できる技法を提案した。入力データはＰＣＡ方法によって互いに直交する主成分ベクトルの空間に移され、ここで中心から離れた度合を測定する統計学基盤の点数化技法を提示した。 The present invention has proposed a search log abuse prevention method and apparatus for maintaining a clean search log through diagnosis and post-processing of search abuse in the search log. That is, a hash bucket-based data structure was constructed to represent IP summary information and / or search term summary information, and this was converted to a discrete probability model to represent input data. A technique that can detect the sample was proposed. The input data is transferred to the space of principal component vectors orthogonal to each other by the PCA method, where a statistically based scoring technique for measuring the degree away from the center is presented.

最後に、情報理論に基づいて異常なサンプルを正常なサンプルに変換する減点技法を提案した。 Finally, a deduction technique is proposed to convert abnormal samples to normal samples based on information theory.

一方、前述した検索ログの悪用を防止する方法は、コンピュータプログラムとして作成可能である。当該プログラムを構成するコード及びコードセグメントは当該分野におけるコンピュータプログラマーによって容易に推論可能である。また、当該プログラムは、コンピュータ読取可能な情報記憶媒体（computer readable media）に記憶され、コンピュータによって読み取られて実行されることによって検索ログの悪用防止方法を具現する。当該情報記憶媒体は、磁気記録媒体、光記録媒体、及びキャリアウェーブ媒体を含む。 On the other hand, the above-described method for preventing the abuse of the search log can be created as a computer program. Codes and code segments constituting the program can be easily inferred by a computer programmer in the field. Further, the program is stored in a computer readable information storage medium (computer readable media), and is executed by being read and executed by a computer, thereby embodying a search log misuse prevention method. The information storage medium includes a magnetic recording medium, an optical recording medium, and a carrier wave medium.

以上では本発明についてその好適な実施例に上げて説明してきた。本発明の属する技術分野における通常の知識を持つ者には、本発明が、本発明の本質的な特性を逸脱しない範囲で変形された形態に具現されることができるということが理解できる。したがって、開示された実施例は、限定的な観点ではなく説明的な観点で考慮されなければならない。本発明の範囲は、前述した説明ではなく特許請求の範囲に現れており、それと同等な範囲内にある差異点はいずれも本発明に含まれたものと解釈されるべきである。 The present invention has been described with reference to preferred embodiments thereof. Those skilled in the art to which the present invention pertains can understand that the present invention can be embodied in a modified form without departing from the essential characteristics of the present invention. Accordingly, the disclosed embodiments should be considered in an illustrative, not a limiting sense. The scope of the present invention appears not in the above description but in the claims, and any difference within the scope equivalent to the scope of the present invention should be construed as being included in the present invention.

前述した如く、本発明によれば、検索ログから異常行為を含む悪用検索語を效率的に検出し、悪用と判断された検索語の検索回数のうち有効な検索回数のみを残し、異常行為を除去することによって、検索ログに対する悪用を防止し、検索ログをきれいに維持することができる。 As described above, according to the present invention, an abusive search word including an abnormal action is efficiently detected from the search log, and only an effective search number among the search times of the search word determined to be an abuse is left, and the abnormal action is detected. By removing it, it is possible to prevent abuse of the search log and keep the search log clean.

Claims

In a method to prevent misuse of search logs,
Selecting an object to be inspected for abnormal behavior from the search log;
Scoring the degree of deviation from normal with respect to the selected object, and detecting abnormal behavior;
A search log misuse prevention method, comprising:

The method according to claim 1, further comprising correcting the search log by removing the detected abnormal action from the search log using a predetermined deduction logic. .

The inspection object selection step includes:
Search term summary information obtained by statistically analyzing the number of times of input of each specific search word included in a predetermined time window from the search log, and IP summary obtained by statistically analyzing the number of times of input of each search word in the specific IP. Generating at least one of the information,
The method of claim 2, wherein the abnormal action is detected from at least one of the search word summary information and the IP summary information in the abnormal action detection step.

The summary information generation step includes:
Generating at least one of an input frequency vector for each IP of a specific search word and an input frequency vector for each search word in a specific IP included within a predetermined time window from the search log;
In order to generate the search word summary information, the dimension of the input frequency vector for each IP of the specific search word is reduced, or the dimension of the input frequency vector of each search word in the specific IP to generate the IP summary information The search log misuse prevention method according to claim 3, further comprising a step of reducing the search log.

The dimension reduction step of the input frequency vector includes:
Converting the IP input frequency vector of each specific search word and the input frequency vector of each search word in the specific IP into a frequency vector for a limited number of buckets using a hashed bucket; The search log misuse prevention method according to claim 4.

The method of claim 3, wherein the search word summary information and the IP summary information are modeled into a multidimensional distribution using a statistical method.

The abnormal action detection step includes:
Calculating the degree of abnormality as a score according to the degree away from the center with respect to at least one of the search word summary information and the IP summary information modeled in the multi-dimensional distribution;
Determining that at least one of the search word summary information and the IP summary information whose calculated score is equal to or greater than a predetermined reference value includes an abnormal action;
The search log misuse prevention method according to claim 6, further comprising:

The abnormal action detection step includes:
The search log of claim 7, further comprising reducing at least one dimension of the modeled search word summary information and IP summary information and compressing data before the calculation step. How to prevent abuse.

The data compression step includes
9. The search log misuse prevention method according to claim 8, wherein the method is performed using a principal component analysis method for mapping input data to mutually orthogonal coordinate systems.

The calculation step includes:
The score for the degree of abnormality is calculated as a ratio to a predetermined reference value using a statistical value modeled through a sum of samples of the standard normal distribution independent of each other in the reduced dimension. Search log abuse prevention method.

The score for the degree of abnormality is expressed by the equation

Where statistic

Follows a chi-square distribution with degrees of freedom d modeled through the sum of samples of standard normal distributions independent of each other,

11. The search log abuse prevention method according to claim 10, wherein represents a maximum boundary of a normal range not exceeding a critical value α.

Said

All beyond

12. The search log abuse prevention method according to claim 11, wherein it is determined that is included in an abnormal range.

The deduction logic uses a KL distance (Kullback-Leibler Distance) representing a distribution difference between at least one probability model of the probability model of the population and the search word summary information and the IP summary information in which the abnormal action is detected. The method according to claim 6, wherein the abnormal action is removed.

The search log correction step includes:
The abnormal action is removed from at least one of the search word summary information and the IP summary information in which the abnormal action is detected using a deduction logic to which an information theory for measuring a distribution difference is applied. Item 4. The search log abuse prevention method according to Item 3.

In the inspection object selection step, a search word and / or IP is selected as an object for inspecting the abnormal action,
The search log abuse prevention method according to claim 1, wherein an abnormal action is detected from the selected search word and / or IP in the abnormal action detection step.

A computer-readable recording medium on which a program for causing a computer to execute the search log misuse prevention method according to any one of claims 1 to 15 is recorded.

A preprocessing unit for selecting an object to be inspected for abnormal behavior from the search log;
An abnormal action detection unit that detects an abnormal action by scoring the degree of deviation from normal with respect to the selected target;
Removing the detected abnormal action from the search log using a predetermined deduction logic, and correcting the search log;
A search log misuse prevention device, comprising:

The pre-processing unit is a search word summary information obtained by statistically analyzing the number of input of each specific search word included in a predetermined time window from the search log according to each IP in order to select an object to be examined for the abnormal action. And / or generate IP summary information that statistically analyzes the number of input times of each search term in a specific IP,
The search log abuse prevention device according to claim 17, wherein the abnormal action detection unit detects the abnormal action from at least one of the search word summary information and / or IP summary information.

The search log abuse prevention apparatus according to claim 18, wherein the search word summary information and / or IP summary information is modeled into a multidimensional distribution using a statistical method.

The pre-processing unit is
Reducing the dimension of the input frequency vector for each IP of the specific search word to generate the search word summary information, and reducing the dimension of the input frequency vector of each search word in the specific IP to generate the IP summary information. The search log misuse prevention apparatus of Claim 19 characterized by these.

The abnormal action detection unit,
The degree of abnormality is calculated as a score by the degree away from the center with respect to the modeled search word summary information and / or IP summary information, and the calculated search word summary information is a predetermined reference value or more, and 20. The search log abuse prevention apparatus according to claim 19, wherein it is determined that an abnormal act is included in the IP summary information.

The abnormal action detection unit,
The search log misuse prevention apparatus according to claim 21, wherein before the score is calculated, the dimension of the modeled search word summary information and / or IP summary information is reduced and the data is compressed.

The abnormal action correction unit
The abnormal act is removed from the search term summary information and / or the IP summary information in which the abnormal act is detected using a deduction logic to which an information theory for measuring a distribution difference is applied. Search log abuse prevention device.