JP2019016115A

JP2019016115A - Identifying device, identifying method, identifying program, model creating device, model creating method and model creating program

Info

Publication number: JP2019016115A
Application number: JP2017132269A
Authority: JP
Inventors: フンタオトラン; Hung Tao Tran; 山田　明; Akira Yamada; 山田　　明; 洸介村上; Kosuke Murakami; 順平浦川; Jumpei Urakawa; 雪子澤谷; Yukiko Sawatani; 歩窪田; Ayumi Kubota
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2019-01-31
Anticipated expiration: 2037-07-05
Also published as: JP6869833B2

Abstract

To provide devices, methods and programs capable of efficiently identifying the kind of a domain relating to a DBD attack, and devices, methods and programs to create a model for identification.SOLUTION: A model creating device 1 includes: an obtaining unit 101 that obtains, for multiple domains to which a label is given and which include a landing domain and a distribution domain in a DBD attack, registered information on the domain; a calculating unit 102 which extracts a word contained in the registered information and which calculates an index relating to an appearance frequency of the word; and a learning unit 106 that creates an identification model in accordance with a learning with a teacher on the basis of the label with the word and the index for the word being as a first feature quantity.SELECTED DRAWING: Figure 2

Description

本発明は、ドメインの種類を識別する装置、及び識別モデルの生成装置に関する。 The present invention relates to an apparatus for identifying a domain type and an identification model generation apparatus.

従来、インターネットには、攻撃者が作成した悪意のあるサイトが含まれており、ユーザにとって、ウェブセキュリティは、非常に重要な課題となっている。例えば、マルウェアをユーザの端末にインストールするページに遷移させるＤＢＤ（Ｄｒｉｖｅ−ｂｙｄｏｗｎｌｏａｄ）攻撃がある。 Conventionally, the Internet includes malicious sites created by attackers, and web security has become a very important issue for users. For example, there is a DBD (Drive-by download) attack that transitions to a page where malware is installed on a user's terminal.

ＤＢＤ攻撃では、「正当なものと思われる」ランディングページに埋め込まれたコードにより、閲覧されたページは、複数のホップポイントページを辿ってディストリビューションページにリダイレクトされ、ディストリビューションページのコードによってマルウェアがユーザの端末にインストールされる（例えば、非特許文献１参照）。 In a DBD attack, the code embedded in the “legitimate” landing page redirects the viewed page to the distribution page through multiple hop point pages, and the code on the distribution page It is installed on the user's terminal (see, for example, Non-Patent Document 1).

これらの攻撃に関係するページを検出することでウェブセキュリティのレベルは向上する。さらに、ランディングページにはアクセスを許すがページ遷移を禁止する、ディストリビューションページへのアクセスは禁止する等、ページの種類に応じて制限レベルを変えることで、セキュリティを維持しつつ、ユーザに適切なウェブブラウジングを提供できる。例えば、非特許文献２−５には、ページの分類技術が示されている。
また、非特許文献６及び７には、ドメインの登録情報であるＷｈｏｉｓを使用して悪意のあるドメインか否かを分類する技術が示されている。 The level of web security is improved by detecting pages related to these attacks. In addition, by changing the restriction level according to the type of page, such as allowing access to the landing page but prohibiting page transition, prohibiting access to the distribution page, etc., it is appropriate for the user while maintaining security. Can provide web browsing. For example, Non-Patent Document 2-5 discloses a page classification technique.
Non-Patent Documents 6 and 7 disclose techniques for classifying whether a domain is a malicious domain using Whois, which is domain registration information.

Ｎ．Ｐｒｏｖｏｓ，Ｐ．Ｍａｖｒｏｍｍａｔｉｓ，Ｍ．Ａ．Ｒａｊａｂ，Ｆ．Ｍｏｎｒｏｓｅ， “ＡｌｌＹｏｕｒｉＦＲＡＭＥｓＰｏｉｎｔｔｏＵｓ”．Ｉｎ：１７ｔｈｃｏｎｆｅｒｅｎｃｅｏｎＳｅｃｕｒｉｔｙｓｙｍｐｏｓｉｕｍ（ＳＳ ’０８），ｐｐ．１−１５，２００８．N. Provos, P.M. Mavrommatis, M.M. A. Rajab, F.A. Monrose, “All Your iFRAMEs Point to Us”. In: 17th conference on Security Symposium (SS '08), pp. 1-15, 2008. Ｊ．Ｗ．Ｓｔｏｋｅｓ，Ｒ．Ａｎｄｅｒｓｅｎ，Ｃ．Ｓｅｉｆｅｒｔ，ａｎｄＫ．Ｃｈｅｌｌａｐｉｌｌａ， “ＷｅｂＣｏｐ：ｌｏｃａｔｉｎｇｎｅｉｇｈｂｏｒｈｏｏｄｓｏｆｍａｌｗａｒｅｏｎｔｈｅｗｅｂ”．Ｉｎ：３ｒｄＵＳＥＮＩＸｃｏｎｆｅｒｅｎｃｅｏｎＬａｒｇｅ−ｓｃａｌｅｅｘｐｌｏｉｔｓａｎｄｅｍｅｒｇｅｎｔｔｈｒｅａｔｓ：ｂｏｔｎｅｔｓ，ｓｐｙｗａｒｅ，ｗｏｒｍｓ，ａｎｄｍｏｒｅ（ＬＥＥＴ ’１０），ｐｐ．５−１３，２０１０．J. et al. W. Stokes, R.M. Andersen, C.I. Seifert, and K.K. Chellapilla, “WebCop: locating neighborboards of malware on the web”. In: 3rd USENIX conference on Large-scale exploits and emergent threats: botnets, sponge, worms, and more (LEET '10), pp. 5-13, 2010. Ｇ．Ｗａｎｇ，Ｊ．Ｗ．Ｓｔｏｋｅｓ，Ｃ．Ｈｅｒｌｅｙ，ａｎｄＤ．Ｆｅｌｓｔｅａｄ， “ＤｅｔｅｃｔｉｎｇＭａｌｉｃｉｏｕｓＬａｎｄｉｎｇＰａｇｅｓｉｎＭａｌｗａｒｅＤｉｓｔｒｉｂｕｔｉｏｎＮｅｔｗｏｒｋｓ”．Ｉｎ：４３ｒｄＣｏｎｆｅｒｅｎｃｅｏｎＤｅｐｅｎｄａｂｌｅＳｙｓｔｅｍｓａｎｄＮｅｔｗｏｒｋｓ（ＤＳＮ ’１３），ｐｐ．１−１１，２０１３．G. Wang, J. et al. W. Stokes, C.I. Herley, and D.H. Felstead, “Detecting Malicous Landing Pages in Malware Distribution Networks”. In: 43rd Conference on Dependable Systems and Networks (DSN '13), pp. 1-11, 2013. Ｔ．Ｎｅｌｍｓ，Ｒ．Ｐｅｒｄｉｓｃｉ，Ｍ．Ａｎｔｏｎａｋａｋｉｓ，ａｎｄＭ．Ａｈａｍａｄ， “ＷｅｂＷｉｔｎｅｓｓ：Ｉｎｖｅｓｔｉｇａｔｉｎｇ，Ｃａｔｅｇｏｒｉｚｉｎｇ，ａｎｄＭｉｔｉｇａｔｉｎｇＭａｌｗａｒｅＤｏｗｎｌｏａｄＰａｔｈｓ”．Ｉｎ：２４ｔｈＵＳＥＮＩＸＳｅｃｕｒｉｔｙＳｙｍｐｏｓｉｕｍ（ＵＳＥＮＩＸ ’１５），ｐｐ．１０２５−１０４０，２０１５．T.A. Nelms, R.A. Perdisci, M.M. Antonakakis, and M.A. Ahamad, “WebWitness: Investigating, Categorizing, and Mitigating Malware Download Paths”. In: 24th USENIX Security Symposium (USENIX '15), pp. 1025-1040, 2015. ＧｏｏｇｌｅＳａｆｅＢｒｏｗｓｉｎｇｖｅｒｓｉｏｎ４（ＧＳＢｖ４），インターネット＜https://developers.google.com/safe-browsing/v4/＞Google Safe Browsing version 4 (GSBv4), Internet <https://developers.google.com/safe-browsing/v4/> Ｌ．Ｂｉｌｇｅ，Ｅ．Ｋｉｒｄａ，Ｃ．Ｋｒｕｅｇｅｌ，ａｎｄＭ．Ｂａｌｄｕｚｚｉ， “ＥＸＰＯＳＵＲＥ：ＦｉｎｄｉｎｇＭａｌｉｃｉｏｕｓＤｏｍａｉｎｓＵｓｉｎｇＰａｓｓｉｖｅＤＮＳＡｎａｌｙｓｉｓ”，ＮｅｔｗｏｒｋａｎｄＤｉｓｔｒｉｂｕｔｅｄＳｙｓｔｅｍＳｅｃｕｒｉｔｙＳｙｍｐｏｓｉｕｍ，ＮＤＳＳ２０１１．L. Bilge, E .; Kirda, C.I. Kruegel, and M.M. Balduzzi, “EXPOSURE: Finding Malinous Domains Using Passive DNS Analysis”, Network and Distributed System Security 11 Symposium. Ｍ．Ｋｕｙａｍａ，Ｙ．Ｋａｋｉｚａｋｉ，ａｎｄＲ．Ｓａｓａｋｉ， “ＭｅｔｈｏｄｆｏｒＤｅｔｅｃｔｉｎｇａＭａｌｉｃｉｏｕｓＤｏｍａｉｎｂｙｕｓｉｎｇＷＨＯＩＳａｎｄＤＮＳｆｅａｔｕｒｅｓ”，３ｒｄＩｎｔ．Ｃｏｎｆ．ｏｎＤｉｇｉｔａｌＳｅｃｕｒｉｔｙａｎｄＦｏｒｅｎｓｉｃｓ（ＤｉｇｉｔａｌＳｅｃ２０１６）M.M. Kuyama, Y .; Kakizaki, and R.K. Sasaki, “Method for Detecting a Malicius Domain by using WHOIS and DNS features”, 3rd Int. Conf. on Digital Security and Forensics (DigitalSec 2016)

非特許文献２の手法では、ディストリビューションページが検出された際に、参照情報が存在する場合にランディングページが一部検出される。
非特許文献３又は４の手法では、ランディングページか否か、又はディストリビューションページか否かを分類するため、複数の種類を一度に識別できない。
非特許文献５で使用されるブラックリストには、ランディングページ及びディストリビューションページのラベルが付与されているものの、リストに登録されていないページを識別することはできない。 In the technique of Non-Patent Document 2, when a distribution page is detected, a part of the landing page is detected when reference information exists.
In the method of Non-Patent Document 3 or 4, since it is classified whether it is a landing page or a distribution page, a plurality of types cannot be identified at a time.
The black list used in Non-Patent Document 5 is provided with labels of landing page and distribution page, but cannot identify a page that is not registered in the list.

非特許文献６又は７の手法では、ランディング又はディストリビューションといった詳細な分類はできないものの、Ｗｈｏｉｓに記述された属性を抽出することで、悪意のあるドメインか否かを分類する。しかしながら、Ｗｈｏｉｓの記述形式は統一されておらず、属性の抽出作業は容易ではない。 Although the method of Non-Patent Document 6 or 7 cannot perform detailed classification such as landing or distribution, the attribute described in Whois is extracted to classify whether the domain is a malicious domain or not. However, the description format of Whois is not uniform, and attribute extraction work is not easy.

このように、従来の手法では、ＤＢＤ攻撃に関係する問題のあるドメインについて、所定のブラックリストに分類された状態で登録されていなければ、ランディング・ドメイン及びディストリビューション・ドメインの両方を効率的に識別することは難しかった。 As described above, in the conventional method, if a problem domain related to the DBD attack is not registered in a state classified in a predetermined black list, both the landing domain and the distribution domain are efficiently used. It was difficult to identify.

本発明は、ＤＢＤ攻撃に関するドメインの種類を効率的に識別できる装置、方法及びプログラム、並びに識別するためのモデルを生成する装置、方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide an apparatus, a method, and a program capable of efficiently identifying a domain type related to a DBD attack, and an apparatus, method, and program for generating a model for identification.

本発明に係るモデル生成装置は、ＤＢＤ攻撃におけるランディング・ドメイン、及びディストリビューション・ドメインを含むラベルが付与された複数のドメインについて、当該ドメインの登録情報を取得する取得部と、前記登録情報に含まれる単語を抽出し、当該単語の出現頻度に関する指標を算出する算出部と、前記単語、及び当該単語に対する前記指標を第１の特徴量として、前記ラベルに基づく教師あり学習により識別モデルを生成する学習部と、を備える。 The model generation apparatus according to the present invention includes, for a plurality of domains assigned labels including a landing domain and a distribution domain in a DBD attack, an acquisition unit that acquires registration information of the domain, and the registration information includes And generating a discrimination model by supervised learning based on the label using the calculation unit for calculating an index related to the appearance frequency of the word and the index for the word and the word as a first feature quantity. A learning unit.

前記モデル生成装置は、前記登録情報の登録日及び更新日を抽出する日付抽出部を備え、前記学習部は、前記登録日からの経過日数、及び前記更新日からの経過日数を第２の特徴量として、前記識別モデルを生成してもよい。 The model generation device includes a date extraction unit that extracts a registration date and an update date of the registration information, and the learning unit has a second characteristic of an elapsed day from the registration date and an elapsed day from the update date. As a quantity, the identification model may be generated.

前記モデル生成装置は、前記ドメインのページ文書において、特定種類のタグが出現する第１の回数をカウントするタグカウント部を備え、前記学習部は、前記第１の回数を第３の特徴量として、前記識別モデルを生成してもよい。 The model generation device includes a tag count unit that counts a first number of times a specific type of tag appears in the page document of the domain, and the learning unit uses the first number of times as a third feature amount. The identification model may be generated.

前記モデル生成装置は、前記ドメインのページ文書において、特定種類のファイル拡張子が出現する第２の回数をカウントする拡張子カウント部を備え、前記学習部は、前記第２の回数を第４の特徴量として、前記識別モデルを生成してもよい。 The model generation device includes an extension count unit that counts a second number of times a specific type of file extension appears in the page document of the domain, and the learning unit sets the second number of times to a fourth number. The identification model may be generated as a feature quantity.

前記ラベルは、ホップポイント・ドメインをさらに含んでもよい。 The label may further include a hoppoint domain.

本発明に係るモデル生成方法は、ＤＢＤ攻撃におけるランディング・ドメイン、及びディストリビューション・ドメインを含むラベルが付与された複数のドメインについて、当該ドメインの登録情報を取得する取得ステップと、前記登録情報に含まれる単語を抽出し、当該単語の出現頻度に関する指標を算出する算出ステップと、前記単語、及び当該単語に対する前記指標を第１の特徴量として、前記ラベルに基づく教師あり学習により識別モデルを生成する学習ステップと、をコンピュータが実行する。 The model generation method according to the present invention includes an acquisition step of acquiring registration information of a domain including a landing domain in a DBD attack and a label including a distribution domain, and the registration information includes And generating a discrimination model by supervised learning based on the label, using the calculation step of calculating an index relating to the appearance frequency of the word and the index for the word and the word as a first feature amount The computer performs the learning step.

本発明に係るモデル生成プログラムは、ＤＢＤ攻撃におけるランディング・ドメイン、及びディストリビューション・ドメインを含むラベルが付与された複数のドメインについて、当該ドメインの登録情報を取得する取得ステップと、前記登録情報に含まれる単語を抽出し、当該単語の出現頻度に関する指標を算出する算出ステップと、前記単語、及び当該単語に対する前記指標を第１の特徴量として、前記ラベルに基づく教師あり学習により識別モデルを生成する学習ステップと、をコンピュータに実行させるためのものである。 The model generation program according to the present invention includes an acquisition step of acquiring registration information of a domain including a landing domain in a DBD attack and a label including a distribution domain, and the registration information includes And generating a discrimination model by supervised learning based on the label, using the calculation step of calculating an index relating to the appearance frequency of the word and the index for the word and the word as a first feature amount And a learning step for causing the computer to execute the learning step.

本発明に係る識別装置は、指定されたドメインの登録情報を取得する取得部と、前記登録情報に含まれる単語を抽出し、当該単語の出現頻度に関する指標を算出する算出部と、前記単語、及び当該単語に対する前記指標を第１の特徴量として、ＤＢＤ攻撃におけるランディング・ドメイン、及びディストリビューション・ドメインを識別する識別部と、を備える。 The identification device according to the present invention includes an acquisition unit that acquires registration information of a specified domain, a calculation unit that extracts a word included in the registration information and calculates an index related to an appearance frequency of the word, the word, And an identification unit for identifying a landing domain and a distribution domain in a DBD attack using the index for the word as a first feature amount.

本発明に係る識別方法は、指定されたドメインの登録情報を取得する取得ステップと、前記登録情報に含まれる単語を抽出し、当該単語の出現頻度に関する指標を算出する算出ステップと、前記単語、及び当該単語に対する前記指標を第１の特徴量として、ＤＢＤ攻撃におけるランディング・ドメイン、及びディストリビューション・ドメインを識別する識別ステップと、をコンピュータが実行する。 The identification method according to the present invention includes an acquisition step of acquiring registration information of a designated domain, a calculation step of extracting a word included in the registration information and calculating an index relating to an appearance frequency of the word, the word, The computer executes an identification step of identifying the landing domain and the distribution domain in the DBD attack using the index for the word as the first feature amount.

本発明に係る識別プログラムは、指定されたドメインの登録情報を取得する取得ステップと、前記登録情報に含まれる単語を抽出し、当該単語の出現頻度に関する指標を算出する算出ステップと、前記単語、及び当該単語に対する前記指標を第１の特徴量として、ＤＢＤ攻撃におけるランディング・ドメイン、及びディストリビューション・ドメインを識別する識別ステップと、をコンピュータに実行させるためのものである。 The identification program according to the present invention includes an acquisition step of acquiring registration information of a specified domain, a calculation step of extracting an word included in the registration information and calculating an index relating to an appearance frequency of the word, the word, And an identification step for identifying a landing domain and a distribution domain in a DBD attack using the index for the word as a first feature amount.

本発明によれば、ＤＢＤ攻撃に関するドメインを効率的に識別できる。 ADVANTAGE OF THE INVENTION According to this invention, the domain regarding a DBD attack can be identified efficiently.

実施形態に係るＤＢＤ攻撃に関するドメインの種類を示す概念図である。It is a conceptual diagram which shows the kind of domain regarding the DBD attack which concerns on embodiment. 実施形態に係るモデル生成装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the model production | generation apparatus which concerns on embodiment. 実施形態に係る識別装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the identification device which concerns on embodiment. 実施形態に係る識別モデルの入力となる特徴量を示す図である。It is a figure which shows the feature-value used as the input of the identification model which concerns on embodiment.

以下、本発明の実施形態の一例について説明する。
図１は、本実施形態に係るＤＢＤ攻撃に関するドメインの種類を示す概念図である。
ユーザは、ランディング・ドメインに属するランディングページにアクセスすると、このページに埋め込まれたコードにより、ホップポイントページにリダイレクトされる。
さらに、ホップポイントページは、他の複数のホップポイントページをリダイレクトにより経由して、ユーザをディストリビューションページにアクセスさせる。
そして、ディストリビューションページに埋め込まれたコードにより、ユーザの端末にマルウェアがインストールされる。 Hereinafter, an example of an embodiment of the present invention will be described.
FIG. 1 is a conceptual diagram showing the types of domains related to the DBD attack according to the present embodiment.
When a user accesses a landing page belonging to the landing domain, the user is redirected to the hop point page by a code embedded in the page.
Further, the hop point page allows the user to access the distribution page via redirection through a plurality of other hop point pages.
Then, malware is installed on the user's terminal using the code embedded in the distribution page.

本実施形態に係るモデル生成装置１は、ランディング・ドメイン、ホップポイント・ドメイン、ディストリビューション・ドメイン、及び他の正常なドメインを識別するためのモデルを学習により生成する。識別装置２は、モデル生成装置１により生成されたモデルを用いて、未分類のドメインを識別する。 The model generation apparatus 1 according to the present embodiment generates a model for identifying a landing domain, a hoppoint domain, a distribution domain, and other normal domains by learning. The identification device 2 identifies unclassified domains using the model generated by the model generation device 1.

図２は、本実施形態に係るモデル生成装置１の機能構成を示すブロック図である。
モデル生成装置１は、制御部１０及び記憶部１１の他、入出力及び通信のインタフェースを備えた情報処理装置（コンピュータ）であり、記憶部１１に格納されたソフトウェア（モデル生成プログラム）を制御部１０が読み出し実行することにより、本実施形態の各機能を実現する。 FIG. 2 is a block diagram illustrating a functional configuration of the model generation device 1 according to the present embodiment.
The model generation apparatus 1 is an information processing apparatus (computer) provided with an input / output and communication interface in addition to the control unit 10 and the storage unit 11, and the software (model generation program) stored in the storage unit 11 is controlled by the control unit. Each function of the present embodiment is realized by reading and executing 10.

モデル生成装置１の制御部１０は、取得部１０１と、算出部１０２と、日付抽出部１０３と、タグカウント部１０４と、拡張子カウント部１０５と、学習部１０６とを備える。 The control unit 10 of the model generation device 1 includes an acquisition unit 101, a calculation unit 102, a date extraction unit 103, a tag count unit 104, an extension count unit 105, and a learning unit 106.

取得部１０１は、ＤＢＤ攻撃におけるランディング・ドメイン、及びディストリビューション・ドメインを含むラベルが付与された複数のドメインについて、これらドメインの登録情報であるＷｈｏｉｓを取得する。ラベルは、ホップポイント・ドメインをさらに含んでもよい。 The acquisition unit 101 acquires Whois, which is registration information of a domain including a landing domain and a distribution domain including a distribution domain in a DBD attack. The label may further include a hoppoint domain.

なお、学習の教師データとなるラベルが付与されたドメイン情報は、既存のブラックリスト、ホワイトリスト等から取得してもよいし、任意のドメインを手動で分類することにより取得してもよい。 Note that the domain information to which a label serving as learning teacher data is attached may be acquired from an existing black list, white list, or the like, or may be acquired by manually classifying an arbitrary domain.

算出部１０２は、Ｗｈｏｉｓに含まれる単語をテキスト解析により抽出し、これらの単語の出現頻度に関する指標を算出する。
出現頻度に関する指標は、例えばＴＦ−ＩＤＦであり、特定のドメインに頻出する特徴語がＴＦ−ＩＤＦ値と共に、識別のための第１の特徴量として採用される。 The calculation unit 102 extracts words included in Whois by text analysis, and calculates an index related to the appearance frequency of these words.
The index regarding the appearance frequency is, for example, TF-IDF, and feature words that frequently appear in a specific domain are adopted as a first feature amount for identification together with the TF-IDF value.

算出部１０２の処理は、例えば、次の複数のステップを含む。
・Ｗｈｏｉｓに含まれる単語を抽出する。
・不要な種類の単語を除外する。
・単語の辞書を構築する。
・単語毎の文書内の出現数、及び文書数を数える。
・単語毎にＴＦ−ＩＤＦ値を算出する。 The processing of the calculation unit 102 includes, for example, the following plurality of steps.
-Extract words included in Whois.
・ Exclude unnecessary types of words.
・ Build a dictionary of words.
-Count the number of occurrences in the document for each word and the number of documents.
Calculate a TF-IDF value for each word.

日付抽出部１０３は、Ｗｈｏｉｓの登録日及び更新日を抽出する。これらの日付から現在までの経過日数が識別のための第２の特徴量として採用される。 The date extraction unit 103 extracts the registration date and update date of Whois. The number of days elapsed from these dates to the present is adopted as the second feature amount for identification.

ここで、ランディング・ドメイン、ホップポイント・ドメイン及びディストリビューション・ドメインに関するＷｈｏｉｓには、例えば、次のような特徴が多く見られるため、第２の特徴量が識別性能に寄与する。
・ランディング・ドメインに関するＷｈｏｉｓの更新日は、ディストリビューション・ドメインに関するＷｈｏｉｓの更新日よりも古い。
・ランディング・ドメインに関するＷｈｏｉｓの登録日は、通常よりも古い。
・ディストリビューション・ドメインに関するＷｈｏｉｓの登録日は、通常よりも新しい。
・ホップポイント・ドメインに関するＷｈｏｉｓの登録日及び更新日は、ランディング・ドメインよりも古い。 Here, in the Whois relating to the landing domain, the hop point domain, and the distribution domain, for example, the following features are often seen, so the second feature amount contributes to the identification performance.
The Whois renewal date for the landing domain is older than the Whois renewal date for the distribution domain.
• Whois registration dates for landing domains are older than normal.
• Whois registration dates for distribution domains are newer than usual.
• Whois registration and renewal dates for hoppoint domains are older than landing domains.

タグカウント部１０４は、ドメインに含まれるページ文書において、特定種類のタグが出現する第１の回数をカウントする。
特定種類のタグとは、例えば、＜ｆｏｒｍ＞、＜ｉｆｒａｍｅ＞、＜ｈｒｅｆ＞、＜ｌｉｎｋ＞、＜ｓｃｒｉｐｔ＞、＜ｆｒａｍｅ＞、＜ｏｂｊｅｃｔ＞、＜ｅｍｂｅｄ＞の８種類であり、これらの出現回数の合計が識別のための第３の特徴量として採用される。 The tag count unit 104 counts the first number of times a specific type of tag appears in the page document included in the domain.
The specific types of tags are, for example, eight types of <form>, <iframe>, <href>, <link>, <script>, <frame>, <object>, <embed>, and the number of occurrences of these Is used as a third feature quantity for identification.

拡張子カウント部１０５は、ドメインに含まれるページ文書において、特定種類のファイル拡張子が出現する第２の回数をカウントする。
特定種類の拡張子とは、例えば、ｊａｒ、ｓｗｆ、ｐｄｆの３種類であり、これらの出現回数の合計が識別のための第４の特徴量として採用される。 The extension count unit 105 counts the second number of times a specific type of file extension appears in the page document included in the domain.
The specific types of extensions are, for example, three types of jar, swf, and pdf, and the total number of appearances is used as the fourth feature amount for identification.

学習部１０６は、単語、及び単語のＴＦ−ＩＤＦ値を第１の特徴量として、ラベルに基づく教師あり学習により識別モデルを生成する。
学習部１０６は、さらに、Ｗｈｏｉｓの登録日からの経過日数、及び更新日からの経過日数を第２の特徴量として、タグをカウントした第１の回数を第３の特徴量として、拡張子をカウントした第２の回数を第４の特徴量として、それぞれを入力に学習を行ってもよい。 The learning unit 106 generates an identification model by supervised learning based on the label using the word and the TF-IDF value of the word as the first feature amount.
The learning unit 106 further uses the number of days elapsed from the registration date of Whois and the number of days elapsed from the update date as the second feature amount, the first number of times the tag is counted as the third feature amount, and the extension. The second number of times counted may be used as a fourth feature amount, and learning may be performed using each as an input.

なお、学習アルゴリズムには、例えば、決定木、サポートベクタマシン、ナイーブベイズ、ニューラルネットワーク、確率的勾配降下法、ｋ近傍法、ランダムフォレスト等、各種の手法が適宜用いられてよい。 For the learning algorithm, various methods such as a decision tree, support vector machine, naive Bayes, neural network, stochastic gradient descent method, k-nearest neighbor method, random forest, and the like may be used as appropriate.

図３は、本実施形態に係る識別装置２の機能構成を示すブロック図である。
識別装置２は、制御部２０及び記憶部２１の他、入出力及び通信のインタフェースを備えた情報処理装置（コンピュータ）であり、記憶部２１に格納されたソフトウェア（識別プログラム）を制御部２０が読み出し実行することにより、本実施形態の各機能を実現する。 FIG. 3 is a block diagram showing a functional configuration of the identification device 2 according to the present embodiment.
The identification device 2 is an information processing device (computer) having an input / output and communication interface in addition to the control unit 20 and the storage unit 21, and the control unit 20 stores software (identification program) stored in the storage unit 21. By executing the reading, each function of the present embodiment is realized.

識別装置２の制御部２０は、取得部２０１と、算出部２０２と、日付抽出部２０３と、タグカウント部２０４と、拡張子カウント部２０５と、識別部２０６とを備える。 The control unit 20 of the identification device 2 includes an acquisition unit 201, a calculation unit 202, a date extraction unit 203, a tag count unit 204, an extension count unit 205, and an identification unit 206.

取得部２０１は、識別対象となる指定されたドメインの登録情報であるＷｈｏｉｓを取得する。
算出部２０２、日付抽出部２０３、タグカウント部２０４及び拡張子カウント部２０５は、それぞれモデル生成装置１の算出部１０２、日付抽出部１０３、タグカウント部１０４及び拡張子カウント部１０５と同様の機能部である。これらの機能部により、第１〜第４の特徴量が導出され、これらの特徴量が識別部２０６へ入力される。 The acquisition unit 201 acquires Whois, which is registration information of a specified domain to be identified.
The calculation unit 202, date extraction unit 203, tag count unit 204, and extension count unit 205 are the same functions as the calculation unit 102, date extraction unit 103, tag count unit 104, and extension count unit 105 of the model generation device 1, respectively. Part. The first to fourth feature quantities are derived by these function units, and these feature quantities are input to the identification unit 206.

識別部２０６は、モデル生成装置１により生成された識別モデルが実装された、ドメインの種類を判別するための識別器である。
識別部２０６は、入力された第１〜第４の特徴量に基づいて、ＤＢＤ攻撃におけるランディング・ドメイン、ホップポイント・ドメイン及びディストリビューション・ドメインを識別する。 The discriminating unit 206 is a discriminator for discriminating the type of domain in which the discriminating model generated by the model generating device 1 is mounted.
The identification unit 206 identifies the landing domain, hop point domain, and distribution domain in the DBD attack based on the input first to fourth feature values.

図４は、本実施形態に係る識別モデルの入力となる特徴量を示す図である。
識別モデルを生成する際の機械学習の入力、又は生成された識別器の入力となる特徴量は、ドメインに関するＷｈｏｉｓの情報と、ページ情報（ＨＴＭＬ文書）とから取得される。 FIG. 4 is a diagram illustrating feature amounts that are input to the identification model according to the present embodiment.
The feature quantity that is an input of machine learning when generating the identification model or an input of the generated classifier is acquired from the Whois information about the domain and the page information (HTML document).

Ｗｈｏｉｓからは、第１の特徴量として、文書全体をテキスト解析した結果である単語及びそのＴＦ−ＩＤＦ値が得られる。
さらに、第２特徴量として、Ｗｈｏｉｓの属性情報から登録日及び更新日が得られる。
また、ページ情報からは、第３の特徴量として特定のタグの出現回数と、第４の特徴量として特定の拡張子の出現回数とが得られる。 From Whois, as a first feature value, a word that is the result of text analysis of the entire document and its TF-IDF value are obtained.
Furthermore, the registration date and the update date are obtained from the attribute information of Whois as the second feature amount.
Further, from the page information, the number of appearances of a specific tag as the third feature value and the number of appearances of the specific extension as the fourth feature value are obtained.

本実施形態によれば、モデル生成装置１は、ドメイン毎のＷｈｏｉｓをテキスト解析することにより、単語の出現頻度に関する指標を特徴量として抽出し、ＤＢＤ攻撃に関するランディング・ドメイン、及びディストリビューション・ドメインを含むドメインの種類を識別するためのモデルを学習により生成する。
これにより、識別装置２は、記述の形式及び用語が統一されていないＷｈｏｉｓから属性情報を抽出するという高コスト、かつ、精度の低い処理に対して、より容易に特徴量を抽出でき、ＤＢＤ攻撃に関するドメインの種類を効率的に識別できる。 According to the present embodiment, the model generation apparatus 1 performs text analysis on the Whois for each domain, extracts an index relating to the appearance frequency of words as a feature quantity, and sets a landing domain and a distribution domain related to a DBD attack. A model for identifying the type of domain to be included is generated by learning.
As a result, the identification device 2 can more easily extract feature amounts for a high-cost and low-accuracy process of extracting attribute information from Whois whose description format and terms are not unified, and a DBD attack Can efficiently identify the type of domain.

この結果、ネットワーク管理者等は、ユーザのブラウジングに対して、ランディング・ドメインへのアクセスは許可するがリダイレクトを制限したり、ディストリビューション・ドメインへのアクセスを禁止したりといった、ドメインの種類に応じた適切なアクセス制御を効率的に実現できる。 As a result, network administrators can allow access to the landing domain for user browsing, but restrict redirects and prohibit access to the distribution domain. Appropriate access control can be realized efficiently.

また、識別モデルは、Ｗｈｏｉｓの登録日及び更新日に基づく第２の特徴量、ドメイン内のページ文書に記述された特定のタグの出現回数からなる第３の特徴量、及びページ文書に記述された特定の拡張子の出現回数からなる第４の特徴量を用いる。
これにより、ドメインの種類の識別精度が向上し、ランディング・ドメイン及びディストリビューション・ドメインに加えて、ホップポイント・ドメインを精度良く識別できることが期待できる。 The identification model is described in the second feature amount based on the registration date and update date of Whois, the third feature amount including the number of appearances of a specific tag described in the page document in the domain, and the page document. A fourth feature amount consisting of the number of appearances of the specific extension is used.
As a result, the domain type identification accuracy is improved, and it can be expected that the hop point domain can be accurately identified in addition to the landing domain and the distribution domain.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、本実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. Further, the effects described in the present embodiment are merely a list of the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to those described in the present embodiment.

モデル生成装置１によるモデル生成方法、及び識別装置２による識別方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ−ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The model generation method by the model generation device 1 and the identification method by the identification device 2 are realized by software. When realized by software, a program constituting the software is installed in an information processing apparatus (computer). These programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Furthermore, these programs may be provided to the user's computer as a Web service via a network without being downloaded.

１モデル生成装置
２識別装置
１０制御部
１１記憶部
２０制御部
２１記憶部
１０１取得部
１０２算出部
１０３日付抽出部
１０４タグカウント部
１０５拡張子カウント部
１０６学習部
２０１取得部
２０２算出部
２０３日付抽出部
２０４タグカウント部
２０５拡張子カウント部
２０６識別部 DESCRIPTION OF SYMBOLS 1 Model generation apparatus 2 Identification apparatus 10 Control part 11 Storage part 20 Control part 21 Storage part 101 Acquisition part 102 Calculation part 103 Date extraction part 104 Tag count part 105 Extension count part 106 Learning part 201 Acquisition part 202 Calculation part 203 Date extraction Section 204 Tag count section 205 Extension count section 206 Identification section

Claims

An acquisition unit for acquiring registration information of a domain including a landing domain in a DBD (Drive-by download) attack and a label including a distribution domain;
A calculation unit that extracts a word included in the registration information and calculates an index related to the appearance frequency of the word;
A model generation apparatus comprising: a learning unit that generates an identification model by supervised learning based on the label using the word and the index for the word as a first feature amount.

A date extraction unit for extracting a registration date and an update date of the registration information;
The model generation device according to claim 1, wherein the learning unit generates the identification model using the elapsed days from the registration date and the elapsed days from the update date as a second feature amount.

In the page document of the domain, the tag counting unit that counts the first number of times that a specific type of tag appears,
The model generation device according to claim 1, wherein the learning unit generates the identification model using the first number of times as a third feature amount.

An extension count unit that counts a second number of times a specific type of file extension appears in the page document of the domain;
4. The model generation apparatus according to claim 1, wherein the learning unit generates the identification model using the second number of times as a fourth feature amount. 5.

The model generation apparatus according to claim 1, wherein the label further includes a hoppoint domain.

An acquisition step of acquiring registration information of a domain including a landing domain in a DBD (Drive-by download) attack and a label including a distribution domain;
A calculation step of extracting a word included in the registration information and calculating an index related to the appearance frequency of the word;
A model generation method in which a computer executes a learning step of generating an identification model by supervised learning based on the label using the word and the index for the word as a first feature amount.

An acquisition step of acquiring registration information of a domain including a landing domain in a DBD (Drive-by download) attack and a label including a distribution domain;
A calculation step of extracting a word included in the registration information and calculating an index related to the appearance frequency of the word;
A model generation program for causing a computer to execute a learning step of generating an identification model by supervised learning based on the label using the word and the index for the word as a first feature amount.

An acquisition unit for acquiring registration information of a specified domain;
A calculation unit that extracts a word included in the registration information and calculates an index related to the appearance frequency of the word;
An identification apparatus comprising: an identification unit that identifies a landing domain and a distribution domain in a DBD (Drive-by download) attack using the word and the index for the word as a first feature amount.

An acquisition step for acquiring registration information of a specified domain;
A calculation step of extracting a word included in the registration information and calculating an index related to the appearance frequency of the word;
An identification method in which a computer executes an identification step of identifying a landing domain and a distribution domain in a DBD (Drive-by download) attack using the word and the index for the word as a first feature amount.

An acquisition step for acquiring registration information of a specified domain;
A calculation step of extracting a word included in the registration information and calculating an index related to the appearance frequency of the word;
Identification for causing a computer to execute a step of identifying a landing domain and a distribution domain in a DBD (Drive-by download) attack using the word and the index for the word as a first feature amount program.