JP2020052766A

JP2020052766A - Determination device and determination method

Info

Publication number: JP2020052766A
Application number: JP2018181907A
Authority: JP
Inventors: 雪子澤谷; Yukiko Sawatani; 山田　明; Akira Yamada; 山田　　明; 歩窪田; Ayumi Kubota
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2020-04-02
Anticipated expiration: 2038-09-27
Also published as: JP7175148B2

Abstract

To provide a determination device and a determination method which can accurately determine whether or not a site is a harmful site from URL of the site.SOLUTION: A determination device 100 includes: a vectorization unit 40 which vectorizes a character string of arbitrary URL using a URL vectorization model VM for regarding a character string included in URL as one sentence and vectorizing the character string; and a determination unit 60 which determines whether or not a site indicated by the URL to be determined is harmful using vectors obtained by vectorizing each of a plurality of character strings of URL that has been previously determined whether or not to be harmful by the vectorization unit 40, and a vector obtained by vectorizing the character string of the URL to be determined by the vectorization unit 40.SELECTED DRAWING: Figure 1

Description

本発明は、ＵＲＬの文字列を用いてＵＲＬのサイトが有害か否かを判定する判定装置及び判定方法に関する。 The present invention relates to a determination device and a determination method for determining whether a URL site is harmful using a URL character string.

従来、ネットワークを介して提供されるサービスの種類の増加に伴い、各種サービスを提供するサイトの数も増加している。これにより、例えば、正規のサービスを提供するサイトになりすまし、クレジットカードの番号等のユーザ情報を不正に取得するフィッシングサイト等の有害なサイトの数も増大している。しかしながら、ユーザが、サイトのＵＲＬ（Uniform Resource Locator）を見て、正規のサービスを提供するサイトか否かを判断することは難しい。 2. Description of the Related Art Conventionally, as the types of services provided via a network increase, the number of sites providing various services has also increased. As a result, for example, the number of harmful sites such as phishing sites for impersonating sites that provide legitimate services and illegally acquiring user information such as credit card numbers is increasing. However, it is difficult for a user to determine whether a site provides a legitimate service by looking at the URL (Uniform Resource Locator) of the site.

例えば、Ｇｏｏｇｌｅ（登録商標）社は、日々の調査に基づいて安全でないサイトのデータベースを構築し、ウェブブラウザ等のアプリケーションが当該データベースを利用できるＡＰＩ（Application Programming Interface）を提供している（非特許文献１参照）。これにより、ユーザは、ウェブブラウザ等のアプリケーションを用いて、正規のサイトか否かを判断できる。 For example, Google (registered trademark) builds a database of insecure sites based on daily research and provides an API (Application Programming Interface) that allows an application such as a web browser to use the database (non-patented). Reference 1). This allows the user to determine whether the site is a legitimate site using an application such as a web browser.

また、悪性ＵＲＬ群における木構造の共通部分と各ＵＲＬの状態に基づき、フィルタリングに利用する部分ＵＲＬの粒度を決定することにより、悪性ＵＲＬに対応可能なフィルタリング手法が提案されている（非特許文献２参照）。また、類似するＵＲＬ構造をルール化し類似度を算出することにより、フィッシングの攻撃を検知する手法も提案されている（非特許文献３参照）。 In addition, a filtering method capable of coping with a malicious URL has been proposed by determining the granularity of a partial URL used for filtering based on a common part of a tree structure in a group of malicious URLs and the state of each URL (Non-Patent Document) 2). Further, a method of detecting a phishing attack by making similar URL structures into rules and calculating similarity has been proposed (see Non-Patent Document 3).

ＧｏｏｇｌｅＳａｆｅＢｒｏｗｓｉｎｇ（ｈｔｔｐｓ：／／ｄｅｖｅｌｏｐｅｒｓ．ｇｏｏｇｌｅ．ｃｏｍ／ｓａｆｅ−ｂｒｏｗｓｉｎｇ／ｖ４／）Google Safe Browsing (https://developers.google.com/safe-browsing/v4/) 秋山満昭、八木毅、伊藤光恭、“悪性ＵＲＬ群の木構造に着目したＵＲＬフィルタリングの粒度決定”、電子情報通信学会技術研究報告、１１０号、２６６、ｐｐ．５３−５８、２０１０Mitsuaki Akiyama, Takeshi Yagi, Mitsuyasu Ito, "Granularity Determination of URL Filtering Focusing on Tree Structure of Malicious URLs", IEICE Technical Report, No. 110, 266, pp. 53-58, 2010 Ｐ．Ｐｒａｋａｓｈｅｔａｌ、“ＰｈｉｓｈＮｅｔ：ＰｒｅｄｉｃｔｉｖｅＢｌａｃｋｌｉｓｔｉｎｇｔｏＤｅｔｅｃｔＰｈｉｓｈｉｎｇＡｔｔａｃｋｓ”、ＩＥＥＥＩＮＦＯＣＯＭ、ｐｐ．３４６−３５０、２０１０P. Prakash et al, "PhishNet: Predictive Blacklisting to Detect Phishing Attackes", IEEE INFOCOM, pp. 139-157. 346-350, 2010

しかしながら、有害なサイトは日々増加しているとともに、ＵＲＬの文字列も変化している。このため、フィルタリングにおけるＵＲＬの文字列のルールを一意に決定することが困難である。また、データベースにおいて有害なサイト全てを網羅することが難しい。 However, the number of harmful sites is increasing every day, and the character strings of URLs are also changing. For this reason, it is difficult to uniquely determine the rules of the URL character string in the filtering. Also, it is difficult to cover all harmful sites in the database.

本発明は、サイトのＵＲＬから有害なサイトか否かを精度良く判定できる判定装置及び判定方法を提供することを目的とする。 An object of the present invention is to provide a determination device and a determination method that can accurately determine whether a site is harmful from the URL of the site.

（１）本発明に係る判定装置は、ＵＲＬに含まれる文字列を１つの文章とみなしてベクトル化するためのＵＲＬベクトル化モデルを用いて任意のＵＲＬの文字列をベクトル化するベクトル化部と、前記ベクトル化部により予め有害か否かが判定された複数のＵＲＬの文字列がそれぞれベクトル化されたベクトルと、前記ベクトル化部により判定対象のＵＲＬの文字列がベクトル化されたベクトルと、を用いて、前記判定対象のＵＲＬが示すサイトが有害か否かを判定する判定部と、を備える。 (1) A determination device according to the present invention includes a vectorization unit that vectorizes a character string of an arbitrary URL using a URL vectorization model for vectorizing a character string included in a URL as one sentence. A vector in which the character strings of the plurality of URLs that have been previously determined to be harmful or not by the vectorization unit are vectorized, and a vector in which the character string of the URL to be determined is vectorized by the vectorization unit, And a determination unit that determines whether or not the site indicated by the URL to be determined is harmful.

（２）（１）に記載の判定装置において、前記複数のＵＲＬのそれぞれが有害か否かの判定結果を示すラベルと、前記複数のＵＲＬの文字列が前記ベクトル化部によりそれぞれベクトル化されたベクトルと、を教師データとして、機械学習することで、任意のＵＲＬが示すサイトが有害か否かを判定する判定モデルを生成する学習部をさらに備え、前記判定部は、前記学習部により生成された前記判定モデルを用いて、前記判定対象のＵＲＬが示すサイトが有害か否かを判定しても良い。 (2) In the determination device according to (1), a label indicating a determination result as to whether each of the plurality of URLs is harmful and a character string of the plurality of URLs are vectorized by the vectorization unit. And a learning unit that generates a determination model for determining whether a site indicated by an arbitrary URL is harmful by machine learning using the vector as teacher data, and the determination unit is generated by the learning unit. The determination model may be used to determine whether the site indicated by the URL to be determined is harmful.

（３）（１）に記載の判定装置において、前記複数のＵＲＬのそれぞれが有害か否かの判定結果を示すラベルと、前記複数のＵＲＬの文字列がそれぞれベクトル化されたベクトルと、を対応付けして対応付けデータを生成する対応付け部をさらに備え、前記判定部は、生成された前記対応付けデータを用いて、前記判定対象のＵＲＬが示すサイトが有害か否かを判定しても良い。 (3) In the determination device according to (1), a label indicating a determination result as to whether or not each of the plurality of URLs is harmful corresponds to a vector in which the character string of each of the plurality of URLs is vectorized. A determining unit configured to determine whether a site indicated by the URL to be determined is harmful using the generated mapping data. good.

（４）（１）に記載の判定装置において、前記ＵＲＬベクトル化モデルは、ＵＲＬの文字列を少なくともクエリ部、パス部及びホスト名の構造別に分割して生成される短い文字列を１つの文章として文章ベクトル化するための、前記ＵＲＬの文字列から前記構造別の短い文字列のベクトルを生成するベクトル化モデルであって、前記ベクトル化部は、前記ＵＲＬベクトル化モデルを用いて任意のＵＲＬの文字列を前記構造別にベクトル化し、前記判定部は、前記ベクトル化部により予め有害か否かが判定された複数のＵＲＬの文字列から前記構造別に生成されたベクトルと、前記ベクトル化部により前記判定対象のＵＲＬの文字列から前記構造別に生成されたベクトルと、を用いて、前記判定対象のＵＲＬが示すサイトが有害か否かを判定しても良い。 (4) In the determination device according to (1), the URL vectorization model converts a short character string generated by dividing a URL character string into at least a query part, a path part, and a host name structure into one sentence. A vectorization model for generating a vector of a short character string for each structure from the character string of the URL for text vectorization as described above, wherein the vectorization unit uses the URL vectorization model to generate an arbitrary URL. Is vectorized for each of the structures, and the determining unit generates a vector generated for each of the structures from a plurality of URL character strings that are determined in advance as to whether they are harmful by the vectorizing unit, and the vectorizing unit Determining whether the site indicated by the URL to be determined is harmful, using a vector generated for each structure from the character string of the URL to be determined And it may be.

（５）（４）に記載の判定装置において、前記判定部は、有害なサイトの種類に応じて前記ＵＲＬの構造を選択しても良い。 (5) In the determination device according to (4), the determination unit may select a structure of the URL according to a type of a harmful site.

（６）（４）に記載の判定装置において、前記ＵＲＬベクトル化モデルは、さらに、
ＵＲＬの文字列から前記構造別に生成されたベクトルを連結して連結ベクトルを生成するベクトル化モデルであって、前記ベクトル化部は、さらに、前記ＵＲＬベクトル化モデルを用いて任意のＵＲＬの文字列から前記構造別に生成されたベクトルを連結した連結ベクトルを生成し、前記判定部は、さらに、前記ベクトル化部により予め有害か否かが判定された複数のＵＲＬの文字列から生成された前記連結ベクトルと、前記ベクトル化部により前記判定対象のＵＲＬの文字列から生成された連結ベクトルと、を用いて、前記判定対象のＵＲＬが示すサイトが有害か否かを判定しても良い。 (6) In the determination device according to (4), the URL vectorization model further includes:
A vectorization model for generating a connected vector by connecting vectors generated for each of the structures from a URL character string, wherein the vectorization unit further uses the URL vectorization model to generate an arbitrary URL character string. Generating a connected vector obtained by connecting the vectors generated for each of the structures, and the determining unit further generates the connected vector generated from the character strings of the plurality of URLs that are determined in advance as to whether they are harmful by the vectorizing unit. Whether or not the site indicated by the URL to be determined is harmful may be determined using a vector and a connected vector generated from the character string of the URL to be determined by the vectorization unit.

（７）本発明に係る判定方法は、コンピュータにより実現される判定方法であって、ＵＲＬに含まれる文字列を１つの文章とみなしてベクトル化するためのＵＲＬベクトル化モデルを用いて任意の文字列をベクトル化するベクトル化ステップと、前記ベクトル化ステップにおいて、予め有害か否かが判定された複数のＵＲＬの文字列がベクトル化されたベクトルと、判定対象のＵＲＬの文字列がベクトル化されたベクトルと、を用いて、前記判定対象のＵＲＬが示すサイトが有害か否かを判定する判定ステップと、を備える。 (7) The determination method according to the present invention is a determination method realized by a computer, and uses a URL vectorization model for vectorizing a character string included in a URL by regarding the character string as one sentence. A vectorization step of vectorizing a column, and in the vectorization step, a vector in which character strings of a plurality of URLs that have been determined to be harmful or not are vectorized, and a character string of a URL to be determined is vectorized. A determination step of determining whether the site indicated by the URL to be determined is harmful using the determined vector.

本発明によれば、サイトのＵＲＬから有害なサイトか否かを精度良く判定できる。 According to the present invention, it is possible to accurately determine whether or not a site is harmful from the URL of the site.

第１の実施形態に係る判定装置の一例を示す図である。FIG. 2 is a diagram illustrating an example of a determination device according to the first embodiment. 第１の実施形態に係る判定装置における生成処理を例示する図である。FIG. 5 is a diagram illustrating a generation process in the determination device according to the first embodiment. 第１の実施形態に係る判定装置における判定処理を例示する図である。FIG. 5 is a diagram illustrating a determination process in the determination device according to the first embodiment. 第２の実施形態に係る判定装置の一例を示す図である。It is a figure showing an example of a judgment device concerning a 2nd embodiment. 第２の実施形態に係る判定装置における生成処理を例示する図である。FIG. 13 is a diagram illustrating a generation process in the determination device according to the second embodiment. 第２の実施形態に係る判定装置における判定処理を例示する図である。It is a figure which illustrates the judgment processing in the judgment device concerning a 2nd embodiment. 第３の実施形態に係る判定装置の一例を示す図である。It is a figure showing an example of a judgment device concerning a 3rd embodiment. 第３の実施形態に係る判定装置における生成処理を例示する図である。It is a figure which illustrates the generation processing in the judgment device concerning a 3rd embodiment. 第３の実施形態に係る判定装置における判定処理を例示する図である。It is a figure which illustrates the judgment processing in the judgment device concerning a 3rd embodiment.

以下、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described.

［第１の実施形態］
図１は、第１の実施形態に係る判定装置の一例を示す図である。 [First Embodiment]
FIG. 1 is a diagram illustrating an example of the determination device according to the first embodiment.

第１の実施形態に係る判定装置１００は、例えば、プロセッサ等の制御部１０と、ハードディスク装置やメモリ等の記憶部２０とを有するパーソナルコンピュータ又はサーバ等の情報処理装置（コンピュータ）である。また、判定装置１００は、入出力デバイス及び通信インタフェース等の外部装置とのインタフェース機能を有する。これにより、判定装置１００は、有線又は無線を介して、外部の記憶装置２００、記憶装置３００及び記憶装置４００に接続される。なお、判定装置１００は、ネットワークを介して、記憶装置２００、記憶装置３００及び記憶装置４００に接続されても良い。 The determination device 100 according to the first embodiment is, for example, an information processing device (computer) such as a personal computer or a server having a control unit 10 such as a processor and a storage unit 20 such as a hard disk device or a memory. Further, the determination device 100 has an interface function with an external device such as an input / output device and a communication interface. Accordingly, the determination device 100 is connected to the external storage device 200, the storage device 300, and the storage device 400 via a wire or wirelessly. Note that the determination device 100 may be connected to the storage device 200, the storage device 300, and the storage device 400 via a network.

記憶装置２００は、ハードディスク装置等を含むデータサーバ等であり、ＵＲＬデータ２１０を記憶する。ＵＲＬデータ２１０は、複数のサイトのＵＲＬの文字列を含むが、アクセスログ、既知の有害サイトのリストや正規のサイトのリスト等を含んでも良い。 The storage device 200 is a data server or the like including a hard disk device or the like, and stores URL data 210. The URL data 210 includes character strings of URLs of a plurality of sites, but may include an access log, a list of known harmful sites, a list of legitimate sites, and the like.

記憶装置３００は、ハードディスク装置等を含むデータサーバ等であり、例えば、悪性度が付与された有害ＵＲＬデータ３１０を記憶する。記憶装置３００は、例えばＧｏｏｇｌｅＳａｆｅＢｒｏｗｓｉｎｇ（登録商標）のＵＲＬデータを含んでもよい。なお、ＧｏｏｇｌｅＳａｆｅＢｒｏｗｓｉｎｇのＵＲＬデータに限定されず、任意のサイトが提供する悪性度が付与されたＵＲＬデータでも良い。
記憶装置４００は、ハードディスク装置等を含むデータサーバ等であり、例えば、任意のサイトが提供する正規なＵＲＬを示す正規ＵＲＬデータ４１０を記憶する。 The storage device 300 is a data server or the like including a hard disk device or the like, and stores, for example, harmful URL data 310 to which a degree of malignancy is assigned. The storage device 300 may include, for example, URL data of Google Safe Browsing (registered trademark). It should be noted that the URL data is not limited to the URL data of Google Safe Browsing, and may be URL data to which an arbitrary site provides a degree of malignancy.
The storage device 400 is a data server including a hard disk device and the like, and stores, for example, regular URL data 410 indicating a regular URL provided by an arbitrary site.

判定装置１００は、制御部１０が記憶部２０に記憶された判定処理のプログラムを実行することにより、モデル生成部３０と、ベクトル化部４０と、学習部５０と、判定部６０との機能を有する。 The determination device 100 controls the functions of the model generation unit 30, the vectorization unit 40, the learning unit 50, and the determination unit 60 by causing the control unit 10 to execute the determination processing program stored in the storage unit 20. Have.

モデル生成部３０は、例えば、記憶装置２００に記憶されるＵＲＬデータ２１０のセットから必要な情報を抽出し、各ＵＲＬを短い文字列に分割する。ここでは、必要な情報とは、ＵＲＬの文字列である。なお、ＵＲＬの文字列を短い文字列に分割する手法としてＮｇｒａｍ（Ｎ≧２）を適用してもよいが、これに限られない。任意の分割する手法を適用してもよい。
モデル生成部３０は、このように分割された文字列となったＵＲＬを１つの文章とみなしてベクトル化するための、ＵＲＬベクトル化モデルＶＭ（以下、「ベクトル化モデルＶＭ」ともいう）を生成する。ここでは、例えば、当業者にとって公知のＤｏｃ２ｖｅｃ（登録商標）（Q．Le and T．Mikolov、“Distributed Representations of sentences and documents”、Proceedings of the 31st International Conference on International Conference on Machine Learning、pp．II-1188−II-1196、2014）を用いて、ＵＲＬベクトル化モデルＶＭを生成してもよい。こうすることで、任意のＵＲＬの文字列を１つの文章とみなしてベクトル化するためのＵＲＬベクトル化モデルが生成される。 The model generation unit 30 extracts necessary information from a set of URL data 210 stored in the storage device 200, for example, and divides each URL into short character strings. Here, the necessary information is a character string of a URL. Note that Ngram (N ≧ 2) may be applied as a method of dividing the URL character string into short character strings, but is not limited thereto. An arbitrary dividing method may be applied.
The model generation unit 30 generates a URL vectorization model VM (hereinafter, also referred to as a “vectorization model VM”) for vectorizing the URL that has become the divided character string as one sentence. I do. Here, for example, Doc2vec (registered trademark) (Q. Le and T. Mikolov, “Distributed Representations of sentences and documents”, Proceedings of the 31st International Conference on International Conference on Machine Learning, pp. II-) known to those skilled in the art. 1188-II-1196, 2014) may be used to generate the URL vectorized model VM. In this manner, a URL vectorization model for vectorizing a character string of an arbitrary URL as one sentence is generated.

なお、Ｄｏｃ２ｖｅｃは、ディープラーニング等の機械学習に基づいてＵＲＬベクトル化モデルＶＭを生成する。Ｄｏｃ２ｖｅｃにより生成されるＵＲＬベクトル化モデルＶＭは、ＵＲＬを文章としてベクトル化することにより他のＵＲＬとの類似の度合いを計算でき、類似するＵＲＬの探索やグループ化ができる。以上のように、モデル生成部３０は、例えばＤｏｃ２ｖｅｃにおける機械学習に基づいて、任意のＵＲＬの文字列を１つの文章とみなしてベクトル化するためのＵＲＬベクトル化モデルＶＭを生成する。モデル生成部３０は、生成したＵＲＬベクトル化モデルＶＭを記憶部２０に記憶する。なお、モデル生成部３０は、ＵＲＬベクトル化モデルＶＭを生成するに際してＤｏｃ２ｖｅｃを用いたが、これに限られない。文章をベクトル化するための文章ベクトル化モデルを生成することができる他のソフトウェアを用いても良い。 Note that Doc2vec generates the URL vectorized model VM based on machine learning such as deep learning. The URL vectorization model VM generated by Doc2vec can calculate the degree of similarity with another URL by vectorizing the URL as a sentence, and can search and group similar URLs. As described above, the model generation unit 30 generates the URL vectorization model VM for vectorizing a character string of an arbitrary URL as one sentence based on machine learning in Doc2vec, for example. The model generation unit 30 stores the generated URL vectorization model VM in the storage unit 20. Note that the model generation unit 30 uses Doc2vec when generating the URL vectorized model VM, but is not limited thereto. Other software that can generate a sentence vectorization model for vectorizing a sentence may be used.

また、モデル生成部３０は、１つの記憶装置２００からＵＲＬデータ２１０を取得したが、複数の記憶装置２００から様々なＵＲＬデータ２１０を取得するとともに、悪性度が付与された有害ＵＲＬデータ３１０、及び正規なＵＲＬの正規ＵＲＬデータ４１０を取得しても良い。あるいは、モデル生成部３０は、記憶装置２００からＵＲＬデータ２１０を取得するだけでなく、自ら様々なサイトを検索しＵＲＬのデータを取得しても良い。
以下、簡単のため、特に断らない限り、ＵＲＬベクトル化モデルＶＭを「ベクトル化モデルＶＭ」ともいう。 In addition, the model generation unit 30 acquires the URL data 210 from one storage device 200, but acquires various URL data 210 from a plurality of storage devices 200, and also includes the harmful URL data 310 to which the malignancy is assigned, and The regular URL data 410 of the regular URL may be obtained. Alternatively, the model generating unit 30 may not only acquire the URL data 210 from the storage device 200 but also search various sites and acquire the URL data.
Hereinafter, for simplicity, unless otherwise specified, the URL vectorized model VM is also referred to as a “vectorized model VM”.

ベクトル化部４０は、記憶部２０からベクトル化モデルＶＭを読み込み、ベクトル化モデルＶＭを用いて、例えば、予め有害か否かが判定されているＵＲＬデータや、判定対象のＵＲＬ等の任意のＵＲＬの文字列をベクトル化する。
より具体的には、ベクトル化部４０は、例えば、記憶装置３００から、悪性度が付与された有害ＵＲＬデータ３１０を入力し、当該ＵＲＬの文字列データをベクトル化したＵＲＬベクトルと、当該ＵＲＬの悪性度を示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶する。
また、ベクトル化部４０は、例えば、記憶装置４００から、正規ＵＲＬデータ４１０を入力し、当該ＵＲＬの文字列データをベクトル化したＵＲＬベクトルと、当該ＵＲＬが正規なＵＲＬであることを示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶する。なお、ベクトル化部４０は、記憶装置２００から取得したＵＲＬデータ２１０の複数のＵＲＬのうち、有害ＵＲＬデータ３１０に含まれないＵＲＬの各々を、正規のＵＲＬとしてベクトル化したＵＲＬベクトルと、当該ＵＲＬが正規なＵＲＬであることを示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶しても良い。
また、ベクトル化部４０は、判定対象のＵＲＬの文字列をベクトル化する。 The vectorization unit 40 reads the vectorization model VM from the storage unit 20, and uses the vectorization model VM, for example, URL data that has been previously determined to be harmful or an arbitrary URL such as a URL to be determined. Vectorize the character string of
More specifically, the vectorizing unit 40 receives, for example, the harmful URL data 310 to which the degree of malignancy is assigned from the storage device 300, and a URL vector obtained by vectorizing the character string data of the URL and the URL of the URL. Teacher data including a label indicating the degree of malignancy is generated, and the generated teacher data is stored in the storage unit 20.
In addition, the vectorization unit 40 receives, for example, the normal URL data 410 from the storage device 400, and a URL vector obtained by vectorizing the character string data of the URL, and a label indicating that the URL is a normal URL. Is generated, and the generated teacher data is stored in the storage unit 20. In addition, the vectorization unit 40 converts a URL not included in the harmful URL data 310 among a plurality of URLs of the URL data 210 acquired from the storage device 200 into a URL vector that is vectorized as a regular URL, and the URL May be generated, and the generated teacher data may be stored in the storage unit 20.
Further, the vectorization unit 40 vectorizes the character string of the URL to be determined.

学習部５０は、記憶部２０に記憶された教師データを入力することで、例えば、教師あり機械学習を実行する。そうすることで、学習部５０は、判定対象のＵＲＬのサイトが有害か否かを判定するための判定モデルＭＬを生成する。学習部５０は、生成した判定モデルＭＬを記憶部２０に記憶する。ここで、判定モデルとしては、当該ＵＲＬの悪性度を示すラベルの内容に応じて、判定対象のＵＲＬのサイトが有害か否かを判定する２値モデルとしてもよい。また、判定対象のＵＲＬのサイトが、マルウェア配布サイト、アダルトサイト、フィッシングサイト、詐欺サイト等の悪性の種類を判定する多値モデルとしてもよい。 The learning unit 50 executes, for example, supervised machine learning by inputting the teacher data stored in the storage unit 20. By doing so, the learning unit 50 generates the determination model ML for determining whether the site of the URL to be determined is harmful. The learning unit 50 stores the generated determination model ML in the storage unit 20. Here, the determination model may be a binary model that determines whether the site of the URL to be determined is harmful according to the contents of the label indicating the malignancy of the URL. Further, the site of the URL to be determined may be a multi-value model for determining the type of malignancy such as a malware distribution site, an adult site, a phishing site, a fraud site, and the like.

判定部６０は、ベクトル化部４０によりベクトル化された判定対象のＵＲＬのベクトルを判定モデルＭＬに入力し、判定対象のＵＲＬが示すサイトが有害か否かを判定する。判定部６０は、例えば、判定装置１００に含まれるＬＣＤ（Liquid Crystal Display）等のディスプレイに判定結果を表示する。また、学習部５０は、判定対象のＵＲＬのベクトル及び判定結果を追加して機械学習を実行し、判定モデルＭＬを更新しても良い。これにより、判定装置１００は、有害なＵＲＬの判定精度を向上させることができる。 The determination unit 60 inputs the vector of the URL of the determination target vectorized by the vectorization unit 40 to the determination model ML, and determines whether the site indicated by the URL of the determination target is harmful. The determination unit 60 displays the determination result on a display such as an LCD (Liquid Crystal Display) included in the determination device 100, for example. The learning unit 50 may execute the machine learning by adding the vector of the URL to be determined and the determination result, and may update the determination model ML. Thereby, the determination device 100 can improve the determination accuracy of the harmful URL.

図２Ａは、第１の実施形態に係る判定装置１００における生成処理を例示する図である。図２Ａに示した処理は、例えば、判定装置１００の管理者等が判定装置１００に含まれるキーボードやマウス等の入力装置を操作することにより実行される。 FIG. 2A is a diagram illustrating a generation process in the determination device 100 according to the first embodiment. The process illustrated in FIG. 2A is executed, for example, when an administrator or the like of the determination device 100 operates an input device such as a keyboard and a mouse included in the determination device 100.

ステップＳ１において、モデル生成部３０は、記憶装置２００に記憶されるＵＲＬデータ２１０のセットから必要な情報を抽出し、各ＵＲＬを短い文字列に分割し、このように分割された文字列となったＵＲＬを１つの文章とみなして、ＵＲＬベクトル化モデルＶＭ（ベクトル化モデルＶＭ）を生成する。モデル生成部３０は、生成したＵＲＬベクトル化モデルＶＭを記憶部２０に記憶する。 In step S1, the model generation unit 30 extracts necessary information from a set of URL data 210 stored in the storage device 200, divides each URL into short character strings, and obtains a character string thus divided. The generated URL is regarded as one sentence, and a URL vectorization model VM (vectorization model VM) is generated. The model generation unit 30 stores the generated URL vectorization model VM in the storage unit 20.

ステップＳ２において、ベクトル化部４０は、例えば、記憶装置３００から、悪性度が付与された有害ＵＲＬデータ３１０を入力し、当該ＵＲＬの文字列データをベクトル化したＵＲＬベクトルと、当該ＵＲＬの悪性度を示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶する。また、ベクトル化部４０は、記憶装置４００から、正規ＵＲＬデータ４１０を入力し、当該ＵＲＬの文字列データをベクトル化したＵＲＬベクトルと、当該ＵＲＬが正規なＵＲＬであることを示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶する。 In step S2, for example, the vectorizing unit 40 inputs, from the storage device 300, harmful URL data 310 to which the degree of malignancy is assigned, and a URL vector obtained by vectorizing the character string data of the URL, and the degree of malignancy of the URL. Is generated, and the generated teacher data is stored in the storage unit 20. Further, the vectorization unit 40 receives the normal URL data 410 from the storage device 400, and obtains a URL vector obtained by vectorizing the character string data of the URL, and a label indicating that the URL is a normal URL. The generated teacher data is stored in the storage unit 20.

ステップＳ３において、学習部５０は、記憶部２０に記憶した教師データを入力し、教師あり機械学習を実行する。そうすることで、学習部５０は、判定対象のＵＲＬのサイトが有害か否かを判定するための判定モデルＭＬを生成する。学習部５０は、生成した判定モデルＭＬを記憶部２０に記憶する。
なお、ステップＳ１のＵＲＬベクトル化モデルＶＭの生成処理と、ステップＳ２の教師データ及びステップＳ３の判定モデルＭＬの生成処理とは、別々に実行されても良い。 In step S3, the learning unit 50 inputs the teacher data stored in the storage unit 20, and performs supervised machine learning. By doing so, the learning unit 50 generates the determination model ML for determining whether the site of the URL to be determined is harmful. The learning unit 50 stores the generated determination model ML in the storage unit 20.
The process of generating the URL vectorized model VM in step S1 and the process of generating the teacher data in step S2 and the determination model ML in step S3 may be executed separately.

図２Ｂは、第１の実施形態に係る判定装置１００における判定処理を例示する図である。図２Ｂに示した処理は、例えば、判定装置１００の管理者等が判定装置１００の入力装置を操作することにより実行される。
ステップＳ４において、判定部６０は、記憶装置２００のＵＲＬデータ２１０等から、判定対象のＵＲＬを取得する。
ステップＳ５において、ベクトル化部４０は、記憶部２０からＵＲＬベクトル化モデルＶＭを読み込み、ＵＲＬベクトル化モデルＶＭを用いて、ステップＳ４で取得した判定対象のＵＲＬの文字列をベクトル化する。 FIG. 2B is a diagram illustrating a determination process in the determination device 100 according to the first embodiment. The process illustrated in FIG. 2B is executed, for example, when an administrator or the like of the determination device 100 operates the input device of the determination device 100.
In step S4, the determination unit 60 acquires the URL to be determined from the URL data 210 or the like of the storage device 200.
In step S5, the vectorization unit 40 reads the URL vectorization model VM from the storage unit 20, and vectorizes the character string of the URL to be determined acquired in step S4 using the URL vectorization model VM.

ステップＳ６において、判定部６０は、ベクトル化部４０によりベクトル化された判定対象のＵＲＬのベクトルを、判定モデルＭＬに入力し、判定対象のＵＲＬのサイトが有害か否かを判定する。判定部６０は、判定装置１００のディスプレイに判定結果を表示する。 In step S6, the determination unit 60 inputs the vector of the URL of the determination target vectorized by the vectorization unit 40 to the determination model ML, and determines whether the site of the URL of the determination target is harmful. The determination unit 60 displays the determination result on the display of the determination device 100.

以上説明したように、第１の実施形態では、判定装置１００は、有害ＵＲＬデータ３１０及び正規ＵＲＬデータ４１０のＵＲＬの文字列データを１つの文章とみなしてベクトル化したＵＲＬベクトルと、各ＵＲＬベクトルに付与されたラベルとを用いて機械学習を実行し、判定モデルＭＬを生成する。そして、判定装置１００は、ベクトル化モデルＶＭを用いて判定対象のＵＲＬの文字列データを１つの文章とみなしてベクトル化し、ベクトル化した判定対象のＵＲＬベクトルを判定モデルＭＬに入力することにより、判定対象のＵＲＬのサイトが有害か否かを判定する。 As described above, in the first embodiment, the determination device 100 determines a URL vector obtained by regarding the harmful URL data 310 and the character string data of the URL of the normal URL data 410 as one sentence, And machine learning is performed by using the label assigned to the. And the determination model ML is generated. Then, the determination device 100 considers the character string data of the URL to be determined as one sentence using the vectorization model VM to vectorize and inputs the vectorized determination target URL vector to the determination model ML, It is determined whether the site of the URL to be determined is harmful.

すなわち、判定装置１００は、ＵＲＬデータ２１０、有害ＵＲＬデータ３１０及び正規ＵＲＬデータ４１０のＵＲＬベクトルと、各ＵＲＬベクトルに付与されたラベルとを教師データとして機械学習を実行することにより、有害性が高いＵＲＬの構造を自動学習する。これにより、判定装置１００は、任意のフィルタリングルール（シグネチャ等）を利用しなくても、判定対象のＵＲＬのサイトが有害か否かの判定の精度良く判定できる。また、判定装置１００は、常に学習と検証を繰り返すことにより、日々変化する攻撃者のＵＲＬの自動的な追従が可能となり、即応性の向上を図ることができる。 In other words, the determination device 100 performs machine learning using the URL vectors of the URL data 210, the harmful URL data 310, and the regular URL data 410, and the label assigned to each URL vector as the teacher data, so that the harmfulness is high. Automatically learn the URL structure. Accordingly, the determination device 100 can accurately determine whether or not the site of the URL to be determined is harmful without using any filtering rule (such as a signature). Further, the determination device 100 can automatically follow the URL of the attacker that changes every day by constantly repeating learning and verification, and can improve responsiveness.

また、判定装置１００は、記憶装置２００のＵＲＬデータ２１０、記憶装置３００の有害ＵＲＬデータ３１０及び記憶装置４００の正規ＵＲＬデータ４１０を取得した後、判定装置１００内で判定処理を実行するため、外部に問い合わせる際のアクセスデータの漏えいを回避できる。また、判定装置１００は、判定装置１００内で判定処理を実行するため、外部との通信コスト（時間的、金銭的）も低減することができる。 After acquiring the URL data 210 of the storage device 200, the harmful URL data 310 of the storage device 300, and the normal URL data 410 of the storage device 400, the determination device 100 executes the determination process in the determination device 100. Access data can be prevented from leaking when making inquiries. In addition, since the determination device 100 performs the determination process in the determination device 100, the cost of communication with the outside (time and money) can be reduced.

［第２の実施形態］
次に第２の実施形態について説明する。第２の実施形態は、ＵＲＬの文字列をＵＲＬの文脈的特徴（例えばクエリ構造、パス構造、ホスト名の構造等）に基づいて抽出される文字列を１つの文章とみなしてベクトル化する。すなわち、ＵＲＬの文脈的特徴別に抽出される文字列別にベクトル化する。この文脈的特徴別に生成されるベクトルを、第１の実施形態におけるＵＲＬベクトルと区別するために、構造別ＵＲＬベクトルという。
このように、第２の実施形態は、構造別ＵＲＬベクトルに基づいて、（構造別に）ベクトル化生成モデルを作成し、（構造別に）機械学習を行い、（構造別に）判定モデルを生成する点に特長がある。
例えば、マルウェア配布サイト等の有害サイトのＵＲＬには、ＵＲＬのクエリ構造に特徴があり、アダルトサイト等の有害サイトのＵＲＬには、パス構造に特徴があると考えられることから、構造別ＵＲＬベクトルに基づいて構造別判定モデルを作成することで、例えば、判定対象のＵＲＬのクエリ構造に係る文字列をベクトル化した構造別ＵＲＬベクトルを、構造別判定モデルに基づいて、当該ＵＲＬがマルウェア配布サイトか否かを判定することができる。
図３は、第２の実施形態に係る判定装置の一例を示す図である。なお、図３では、第１の実施形態に係る判定装置１００の要素と同様の機能を有する要素については、同じ符号を付し、詳細な説明は省略する。 [Second embodiment]
Next, a second embodiment will be described. In the second embodiment, a character string extracted based on context features of a URL (for example, a query structure, a path structure, a structure of a host name, and the like) is converted into a single sentence and vectorized. That is, vectorization is performed for each character string extracted for each contextual feature of the URL. In order to distinguish the vector generated for each contextual feature from the URL vector in the first embodiment, it is referred to as a structure-specific URL vector.
As described above, according to the second embodiment, a vectorization generation model (for each structure) is created based on the URL vector for each structure, machine learning (for each structure) is performed, and a judgment model (for each structure) is generated. There are features.
For example, a URL of a harmful site such as a malware distribution site has a characteristic in a URL query structure, and a URL of a harmful site such as an adult site has a characteristic in a path structure. By creating a structure-based determination model based on a URL, for example, a structure-based URL vector obtained by converting a character string relating to a query structure of a URL to be determined into a vector is converted to a malware distribution site based on the structure-based determination model. Can be determined.
FIG. 3 is a diagram illustrating an example of the determination device according to the second embodiment. In FIG. 3, elements having the same functions as the elements of the determination device 100 according to the first embodiment are denoted by the same reference numerals, and detailed description is omitted.

第２の実施形態に係る判定装置１００Ａは、制御部１０が記憶部２０に記憶された判定処理のプログラムを実行することにより、モデル生成部３０ａと、ベクトル化部４０ａと、学習部５０ａと、判定部６０ａとの機能を有する。 The determination device 100A according to the second embodiment includes a model generation unit 30a, a vectorization unit 40a, a learning unit 50a, and a control unit 10 executing a determination processing program stored in the storage unit 20. It has a function with the determination unit 60a.

モデル生成部３０ａは、ＵＲＬの文脈的特徴（例えばクエリ構造、パス構造、ホスト名の構造等）別に、ＵＲＬベクトル化モデル（「構造別ＵＲＬベクトル化モデル」ともいう）を生成する。
より具体的には、モデル生成部３０ａは、例えば、記憶装置２００に記憶されるＵＲＬデータ２１０のセットからＵＲＬの文脈的特徴（例えばクエリ構造、パス構造、ホスト名の構造等）に基づいて抽出されるＵＲＬの文字列（「ＵＲＬの構造別文字列」という）を、第１の実施形態と同様に例えばＮｇｒａｍ（Ｎ≧２）を適用して短い文字列に分割する。
モデル生成部３０ａは、このように分割された文字列となったＵＲＬの構造別文字列を１つの文章とみなしてベクトル化するための、構造別ＵＲＬベクトル化モデルＶＭを生成する。ここでは、第１の実施形態と同様に例えば、当業者にとって公知のＤｏｃ２ｖｅｃを用いて、構造別ベクトル化モデルを生成してもよい。こうすることで、任意のＵＲＬの構造別文字列を１つの文章とみなしてベクトル化するための、構造別ＵＲＬベクトル化モデルが生成される。ここで、ＫをＵＲＬの文脈的特徴を示す構造の種類の数とした場合、構造別ＵＲＬベクトル化モデルＶＭ（１）−ＶＭ（Ｋ）が生成される。そして、モデル生成部３０ａは、生成した構造別ＵＲＬベクトル化モデルＶＭ（１）−ＶＭ（Ｋ）を記憶部２０に記憶する。 The model generation unit 30a generates a URL vectorization model (also referred to as a “structure-specific URL vectorization model”) for each URL contextual feature (eg, query structure, path structure, host name structure, etc.).
More specifically, the model generation unit 30a extracts, for example, from a set of URL data 210 stored in the storage device 200 based on the contextual features of the URL (for example, a query structure, a path structure, a structure of a host name, and the like). The character string of the URL to be performed (referred to as “URL character string by structure”) is divided into short character strings by applying, for example, Ngram (N ≧ 2), as in the first embodiment.
The model generation unit 30a generates a structure-specific URL vectorization model VM for vectorizing the character string by URL of the URL that has been divided as described above as one sentence. Here, similarly to the first embodiment, for example, a structure-based vectorization model may be generated using Doc2vec known to those skilled in the art. In this manner, a structure-specific URL vectorization model for generating a vector by regarding a structure-based character string of an arbitrary URL as one sentence is generated. Here, assuming that K is the number of types of structures showing the contextual features of the URL, a URL vectorization model VM (1) -VM (K) for each structure is generated. Then, the model generation unit 30a stores the generated structure-specific URL vectorization models VM (1) to VM (K) in the storage unit 20.

ベクトル化部４０ａは、構造別ＵＲＬベクトル化モデルＶＭ（ｉ）（１≦ｉ≦Ｋ）に基づいて、例えば、予めマルウェア配布サイト、アダルトサイト等が判定されているＵＲＬデータや、判定対象のＵＲＬ等の任意のＵＲＬの文字列からＵＲＬの文脈的特徴（例えばクエリ構造、パス構造、ホスト名の構造等）に基づいて抽出されるＵＲＬの構造別文字列を１つの文章とみなしてベクトル化する。
クエリ構造に係る場合を例として説明する。
ベクトル化部４０ａは、指定された構造に基づいて、構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を選択し、当該構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を利用して、記憶装置３００から、例えばマルウェア配布サイトと判定された有害ＵＲＬデータ３１０を入力し、当該ＵＲＬの文字列データから抽出されるクエリ構造に含まれる文字列をベクトル化した構造別ＵＲＬベクトルと、当該ＵＲＬがマルウェア配布サイトであることを示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶する。
同様に、ベクトル化部４０ａは、構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を利用して、記憶装置４００から、正規ＵＲＬデータ４１０を入力し、当該ＵＲＬの文字列データから抽出されるクエリ構造に含まれる文字列をベクトル化した構造別ＵＲＬベクトルと、当該ＵＲＬが正規なＵＲＬであることを示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶する。
また、ベクトル化部４０ａは、構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を利用して、判定対象のＵＲＬから抽出されるクエリ構造に含まれる文字列をベクトル化する。
このように、ベクトル化部４０ａは、指定された構造に基づいて、構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を選択し、当該構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を利用して、ＵＲＬの構造別文字列を１つの文章とみなしてベクトル化する。 The vectorization unit 40a, for example, based on the structure-specific URL vectorization model VM (i) (1 ≦ i ≦ K), for example, URL data in which a malware distribution site, an adult site, or the like is determined in advance, or a URL to be determined A character string by URL structure extracted from a character string of an arbitrary URL such as a URL based on a contextual characteristic of the URL (eg, a query structure, a path structure, a structure of a host name, etc.) is regarded as one sentence and vectorized. .
A case related to the query structure will be described as an example.
The vectorization unit 40a selects a structure-specific URL vectorization model VM (i) based on the specified structure, and uses the structure-specific URL vectorization model VM (i) to read from the storage device 300, for example, The harmful URL data 310 determined to be a malware distribution site is input, and a structure-specific URL vector in which a character string included in a query structure extracted from the character string data of the URL is vectorized, and the URL is a malware distribution site Is generated, and the generated teacher data is stored in the storage unit 20.
Similarly, the vectorization unit 40a inputs the regular URL data 410 from the storage device 400 by using the structure-specific URL vectorization model VM (i), and converts the query structure extracted from the character string data of the URL into a query structure. It generates teacher data including a structure-specific URL vector in which the included character strings are vectorized and a label indicating that the URL is a regular URL, and stores the generated teacher data in the storage unit 20.
Further, the vectorization unit 40a uses the structure-specific URL vectorization model VM (i) to vectorize a character string included in the query structure extracted from the URL to be determined.
As described above, the vectorization unit 40a selects the structure-specific URL vectorization model VM (i) based on the specified structure, and uses the structure-specific URL vectorization model VM (i) to convert the URL. The character string for each structure is regarded as one sentence and vectorized.

学習部５０ａは、記憶部２０に記憶されたＵＲＬの構造別（構造（ｉ）：１≦ｉ≦Ｋ）に生成された教師データを入力することで、例えば、教師あり機械学習を実行する。そうすることで、学習部５０ａは、判定対象のＵＲＬのサイトが有害か否かを構造（ｉ）に係るＵＲＬの構造別文字列から判定するための構造別判定モデルＭＬ（ｉ）を生成する。学習部５０は、生成した構造別判定モデルＭＬ（ｉ）を記憶部２０に記憶する。ここで、判定モデルとしては、判定対象のＵＲＬのサイトが、マルウェア配布サイト、アダルトサイト、フィッシングサイト、詐欺サイト等の悪性の種類を判定する構造別の判定モデルＭＬ（ｉ）としてもよい。 The learning unit 50a executes, for example, supervised machine learning by inputting teacher data generated for each URL structure stored in the storage unit 20 (structure (i): 1 ≦ i ≦ K). By doing so, the learning unit 50a generates the structure-based determination model ML (i) for determining whether the site of the URL to be determined is harmful from the structure-based character string of the URL according to the structure (i). . The learning unit 50 stores the generated structure-based determination model ML (i) in the storage unit 20. Here, as the determination model, the site of the URL to be determined may be a determination model ML (i) for each structure that determines the type of malignancy such as a malware distribution site, an adult site, a phishing site, a fraudulent site, and the like.

判定部６０ａは、例えば、判定装置１００Ａの入力装置を介して入力された管理者等の指示に基づいて、有害なサイトとして判定するマルウェア配布サイトやアダルトサイト等の種類を決定する。判定部６０ａは、決定したサイトの種類に応じた構造（ｉ）の構造別ベクトル化モデルＶＭ（ｉ）及び構造別判定モデルＭＬ（ｉ）を選択する（１≦ｉ≦Ｋ）。判定部６０ａは、選択した構造（ｉ）の構造別ベクトル化モデルＶＭ（ｉ）及び構造別判定モデルＭＬ（ｉ）を記憶部２０から読み込む。ベクトル化部４０ａは、構造別ベクトル化モデルＶＭ（ｉ）を用いて判定対象のＵＲＬの構造（ｉ）の文字列をベクトル化する。判定部６０ａは、判定対象のＵＲＬの構造（ｉ）のベクトルを構造別判定モデルＭＬ（ｉ）に入力し、判定対象のＵＲＬのサイトが有害か否かを判定する。すなわち、判定部６０ａは、有害なサイトの種類に応じて、判定対象のＵＲＬのサイトが有害か否かを判定できる。そして、判定部６０ａは、例えば、判定装置１００Ａのディスプレイに判定結果を表示する。 The determination unit 60a determines the type of a malware distribution site, an adult site, or the like to be determined as a harmful site based on, for example, an instruction from a manager or the like input via the input device of the determination device 100A. The determination unit 60a selects the structure-based vectorization model VM (i) and the structure-based determination model ML (i) of the structure (i) according to the determined site type (1 ≦ i ≦ K). The determination unit 60a reads the structure-based vectorization model VM (i) and the structure-based determination model ML (i) of the selected structure (i) from the storage unit 20. The vectorization unit 40a converts the character string of the URL structure (i) to be determined into a vector using the structure-specific vectorization model VM (i). The determination unit 60a inputs a vector of the structure (i) of the URL to be determined to the structure-based determination model ML (i), and determines whether the site of the URL to be determined is harmful. That is, the determination unit 60a can determine whether the site of the URL to be determined is harmful, according to the type of the harmful site. Then, the determination unit 60a displays the determination result on the display of the determination device 100A, for example.

なお、学習部５０ａは、判定対象のＵＲＬの構造（ｉ）のベクトル及び判定結果を追加して機械学習を実行し、構造別判定モデルＭＬ（ｉ）を更新しても良い。これにより、判定装置１００Ａは、有害なＵＲＬの判定精度を向上させることができる。 The learning unit 50a may execute the machine learning by adding the vector of the structure (i) of the URL to be determined and the determination result, and may update the structure-based determination model ML (i). Thereby, the determination device 100A can improve the determination accuracy of the harmful URL.

図４Ａは、第２の実施形態に係る判定装置１００Ａにおける生成処理を例示する図である。図４Ａに示した処理は、例えば、判定装置１００Ａの管理者等が判定装置１００Ａの入力装置を操作することにより実行される。 FIG. 4A is a diagram illustrating a generation process in the determination device 100A according to the second embodiment. The process illustrated in FIG. 4A is executed, for example, when an administrator of the determination device 100A operates the input device of the determination device 100A.

ステップＳ１１において、モデル生成部３０ａは、記憶装置２００に記憶されるＵＲＬデータ２１０のセットから必要な情報（ＵＲＬの構造別文字列）を抽出し、ＵＲＬの構造別文字列を短い文字列に分割し、このように分割された文字列となったＵＲＬの構造別文字列を１つの文章とみなして、構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を生成する。モデル生成部３０は、生成した構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を記憶部２０に記憶する。 In step S11, the model generating unit 30a extracts necessary information (character string by URL structure) from the set of URL data 210 stored in the storage device 200, and divides the character string by URL structure into short character strings. Then, the structure-based character string of the URL that has been divided as described above is regarded as one sentence, and the structure-based URL vectorization model VM (i) is generated. The model generation unit 30 stores the generated structure-specific URL vectorization model VM (i) in the storage unit 20.

ステップＳ１２において、ベクトル化部４０ａは、ＵＲＬの構造別に教師データを生成する。例えば、記憶装置３００から、悪性度が付与された有害ＵＲＬデータ３１０を入力し、当該ＵＲＬの構造別文字列をベクトル化したＵＲＬベクトルと、当該ＵＲＬの悪性度を示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶する。また、ベクトル化部４０ａは、記憶装置４００から、正規ＵＲＬデータ４１０を入力し、当該ＵＲＬの構造別文字列をベクトル化したＵＲＬベクトルと、当該ＵＲＬが正規なＵＲＬであることを示すラベルと、からなる教師データを生成して、生成された教師データを記憶部２０に記憶する。 In step S12, the vectorization unit 40a generates teacher data for each URL structure. For example, harmful URL data 310 to which a degree of malignancy is given is input from the storage device 300, and teacher data including a URL vector obtained by vectorizing a character string according to the structure of the URL and a label indicating the degree of malignancy of the URL are provided. Is generated, and the generated teacher data is stored in the storage unit 20. Further, the vectorization unit 40a inputs the regular URL data 410 from the storage device 400, a URL vector obtained by vectorizing the structure-based character string of the URL, and a label indicating that the URL is a regular URL, Is generated, and the generated teacher data is stored in the storage unit 20.

ステップＳ１３において、学習部５０ａは、記憶部２０に記憶したＵＲＬの構造別に生成された教師データを入力し、教師あり機械学習を実行する。そうすることで、学習部５０ａは、判定対象のＵＲＬのサイトが有害か否かを判定するための構造別判定モデルＭＬ（ｉ）を生成する。学習部５０は、ＵＲＬの構造別に生成した構造別判定モデルＭＬ（ｉ）を記憶部２０に記憶する。
なお、ステップＳ１１の構造別ＵＲＬベクトル化モデルＶＭの生成処理と、ステップＳ１２の教師データ及びステップＳ１３の構造別判定モデルＭＬの生成処理とは、別々に実行されても良い。 In step S13, the learning unit 50a inputs the teacher data generated for each URL structure stored in the storage unit 20, and performs supervised machine learning. By doing so, the learning unit 50a generates the structure-based determination model ML (i) for determining whether the site of the URL to be determined is harmful. The learning unit 50 stores the structure-based determination model ML (i) generated for each URL structure in the storage unit 20.
Note that the process of generating the structure-specific URL vectorization model VM in step S11 and the process of generating the teacher data in step S12 and the structure-based determination model ML in step S13 may be executed separately.

図４Ｂは、第２の実施形態に係る判定装置１００Ａにおける判定処理を例示する図である。図４Ｂに示した処理は、例えば、判定装置１００Ａの管理者等が判定装置１００Ａの入力装置を操作することにより実行される。
ステップＳ１４において、判定部６０ａは、記憶装置２００のＵＲＬデータ２１０等から、判定対象のＵＲＬを取得する。
ステップＳ１５において、判定部６０ａは、判定装置１００Ａの入力装置を介して入力された管理者等の指示に基づいて、有害なサイトとして判定するマルウェア配布サイトやアダルトサイト等の種類を決定する。判定部６０ａは、決定したサイトの種類に応じた構造（ｉ）の構造別ベクトル化モデルＶＭ（ｉ）及び構造別判定モデルＭＬ（ｉ）を選択する。 FIG. 4B is a diagram illustrating a determination process in the determination device 100A according to the second embodiment. The process illustrated in FIG. 4B is executed, for example, when an administrator of the determination device 100A operates the input device of the determination device 100A.
In step S14, the determination unit 60a acquires a URL to be determined from the URL data 210 or the like of the storage device 200.
In step S15, the determination unit 60a determines the type of a malware distribution site or an adult site to be determined as a harmful site based on an instruction from the administrator or the like input via the input device of the determination device 100A. The determination unit 60a selects the structure-based vectorization model VM (i) and the structure-based determination model ML (i) of the structure (i) according to the determined site type.

ステップＳ１６において、ベクトル化部４０ａは、ステップＳ１５で選択された構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を記憶部２０から読み込み、構造別ＵＲＬベクトル化モデルＶＭ（ｉ）を用いて、判定対象のＵＲＬの構造（ｉ）に係る文字列をベクトル化する。
ステップＳ１７において、判定部６０ａは、判定対象のＵＲＬの構造（ｉ）のベクトルを構造別判定モデルＭＬ（ｉ）に入力し、判定対象のＵＲＬのサイトが有害か否かを判定する。判定部６０ａは、判定装置１００Ａのディスプレイに判定結果を表示する。 In step S16, the vectorization unit 40a reads the structure-specific URL vectorization model VM (i) selected in step S15 from the storage unit 20, and uses the structure-specific URL vectorization model VM (i) to determine the determination target. The character string according to the URL structure (i) is vectorized.
In step S17, the determination unit 60a inputs the vector of the structure (i) of the URL to be determined to the structure-based determination model ML (i), and determines whether the site of the URL to be determined is harmful. The determination unit 60a displays a determination result on a display of the determination device 100A.

以上説明したように、第２の実施形態では、判定装置１００Ａは、ＵＲＬの文字列をＵＲＬの文脈的特徴（例えばクエリ構造、パス構造、ホスト名の構造等）に基づいて抽出されるＵＲＬの構造別文字列をベクトル化した構造別ＵＲＬベクトルと、各構造別ＵＲＬベクトルに付与されたラベルとを用いて機械学習を実行し、構造別判定モデルＭＬ（ｉ）を生成する。そして、判定装置１００Ａは、構造別ベクトル化モデルＶＭ（ｉ）を用いて判定対象のＵＲＬの構造別文字列を１つの文章とみなしてベクトル化し、判定対象の構造別ＵＲＬベクトルを構造別判定モデルＭＬ（ｉ）に入力することにより、例えば、判定対象のＵＲＬのサイトがマルウェア配布サイトか否かを判定することができる。 As described above, in the second embodiment, the determination device 100A converts the character string of the URL into the URL extracted based on the contextual characteristics of the URL (for example, the query structure, the path structure, the structure of the host name, and the like). Machine learning is executed using the structure-specific URL vectors obtained by vectorizing the structure-specific character strings and the labels assigned to the structure-specific URL vectors to generate the structure-specific determination model ML (i). Then, the determination device 100A uses the structure-based vectorization model VM (i) to vectorize the structure-based character string of the URL to be determined as one sentence, and converts the structure-based URL vector to be determined into the structure-based determination model. By inputting to ML (i), for example, it can be determined whether or not the site of the URL to be determined is a malware distribution site.

第２の実施形態では、判定装置１００Ａは、ＵＲＬの文脈的特徴を示す構造毎に、ＵＲＬデータ２１０、有害ＵＲＬデータ３１０及び正規ＵＲＬデータ４１０のＵＲＬのベクトルと、各ベクトルに付与されたラベルとを学習データとして機械学習を実行することにより、有害性が高いＵＲＬの構造を自動学習する。これにより、判定装置１００Ａは、任意のフィルタリングルール（シグネチャ等）を利用しなくても、判定対象のＵＲＬのサイトが有害か否かの判定の精度良く判定できる。また、判定装置１００Ａは、常に学習と検証を繰り返すことにより、日々変化する攻撃者のＵＲＬの自動的な追従が可能となり、即応性の向上を図れる。 In the second embodiment, the determination device 100A determines, for each structure indicating the contextual features of the URL, a vector of the URL of the URL data 210, the harmful URL data 310, and the normal URL data 410, and a label assigned to each vector. Is used as learning data to automatically learn a highly harmful URL structure. Accordingly, the determination device 100A can accurately determine whether or not the site of the URL to be determined is harmful without using any filtering rule (such as a signature). In addition, the determination device 100A can automatically follow the URL of an attacker that changes daily by constantly repeating learning and verification, thereby improving responsiveness.

また、判定装置１００Ａは、記憶装置２００のＵＲＬデータ２１０、記憶装置３００の有害ＵＲＬデータ３１０及び記憶装置４００の正規ＵＲＬデータ４１０を取得した後、判定装置１００Ａ内で判定処理を実行するため、外部に問い合わせる際のアクセスデータの漏えいを回避できる。また、判定装置１００Ａは、判定装置１００Ａ内で判定処理を実行するため、外部との通信コスト（時間的、金銭的）も低減できる。 After acquiring the URL data 210 of the storage device 200, the harmful URL data 310 of the storage device 300, and the regular URL data 410 of the storage device 400, the determination device 100A executes a determination process in the determination device 100A. Access data can be prevented from leaking when making inquiries. In addition, since the determination device 100A executes the determination process in the determination device 100A, communication costs (time and money) with the outside can be reduced.

［第２の実施形態の変形例］
第２の実施形態に係る判定装置１００Ａは、判定装置１００Ａの入力装置を介して入力された管理者等の指示に基づいて、１つの構造（ｉ）の構造別ベクトル化モデルＶＭ（ｉ）及び構造別判定モデルＭＬ（ｉ）を選択したが、これに限られない。例えば、第２の実施形態の変形例として、判定装置１００Ａは、判定装置１００Ａの入力装置を介して入力された管理者等の指示に基づいて、２つ以上の構造（ｉ）の構造別ベクトル化モデルＶＭ（ｉ）及び構造別判定モデルＭＬ（ｉ）を選択しても良い。 [Modification of Second Embodiment]
The determination device 100A according to the second embodiment is configured based on an instruction from a manager or the like input via the input device of the determination device 100A, and the vectorization model VM (i) for one structure (i) and The structure-based determination model ML (i) is selected, but is not limited thereto. For example, as a modified example of the second embodiment, the determination device 100A is configured to output two or more structure-specific vectors of two or more structures (i) based on an instruction from an administrator or the like input via the input device of the determination device 100A. The modelized model VM (i) and the structure-based determination model ML (i) may be selected.

この場合、制御部１０は、例えば、有害ＵＲＬ、正規のＵＲＬ及び判定対象のＵＲＬの各々において、選択した２以上の構造それぞれのベクトルを結合するベクトル結合部（図示せず）としての機能を有することが好ましい。 In this case, the control unit 10 has a function as, for example, a vector combining unit (not shown) that combines vectors of two or more selected structures in each of the harmful URL, the regular URL, and the URL to be determined. Is preferred.

学習部５０ａは、有害ＵＲＬの各々のラベル及び結合されたベクトルと、正規のＵＲＬの各々のラベル及び結合されたベクトルとを用いて、教師あり機械学習を実行し、２以上の構造が結合した判定モデルを生成する。そして、判定部６０ａは、判定対象のＵＲＬの結合されたベクトルを２以上の構造が結合した判定モデルに入力し、判定対象のＵＲＬのサイトが有害か否かを判定する。これにより、判定装置１００Ａは、複数の種類の有害なサイトを判断できる。 The learning unit 50a performs supervised machine learning using each label and the combined vector of the harmful URL and each label and the combined vector of the regular URL, and two or more structures are combined. Generate a judgment model. Then, the determination unit 60a inputs the vector in which the URLs to be determined are combined into a determination model in which two or more structures are combined, and determines whether the site of the URL to be determined is harmful. Thereby, the determination device 100A can determine a plurality of types of harmful sites.

［第３の実施形態］
次に第３の実施形態について説明する。第３の実施形態は、機械学習を実行する学習部が省略され、有害ＵＲＬ又は正規のＵＲＬのベクトルと、判定対象のＵＲＬのベクトルとの類似の度合いに基づいて、判定対象のＵＲＬのサイトが有害か否かを判定する。
なお、以下の第３の実施形態の説明においては、ＵＲＬを１つの文章としてベクトル化するＵＲＬベクトル化モデルＶＭを利用するケースを例示するが、これに限られない。ＵＲＬの文脈的特徴に基づいた構造毎にＵＲＬをベクトル化する複数の構造別ベクトル化モデル（ｉ）を利用しても良い。 [Third Embodiment]
Next, a third embodiment will be described. In the third embodiment, the learning unit that executes the machine learning is omitted, and the site of the URL to be determined is determined based on the degree of similarity between the vector of the harmful URL or the regular URL and the vector of the URL to be determined. Determine whether it is harmful.
In the following description of the third embodiment, a case is described in which a URL vectorization model VM that vectorizes a URL as one sentence is used, but the present invention is not limited to this. A plurality of structure-based vectorization models (i) that vectorize the URL for each structure based on the contextual features of the URL may be used.

図５は、第３の実施形態に係る判定装置の一例を示す図である。なお、図５では、第１の実施形態に係る判定装置１００の要素と同様の機能を有する要素については、同じ符号を付し、詳細な説明は省略する。 FIG. 5 is a diagram illustrating an example of the determination device according to the third embodiment. In FIG. 5, elements having the same functions as the elements of the determination device 100 according to the first embodiment are denoted by the same reference numerals, and detailed description is omitted.

第３の実施形態に係る判定装置１００Ｂは、制御部１０が記憶部２０に記憶された判定処理のプログラムを実行することにより、モデル生成部３０と、ベクトル化部４０と、判定部６０ｂと、対応付け部７０との機能を有する。 The determination device 100B according to the third embodiment includes a model generation unit 30, a vectorization unit 40, a determination unit 60b, and a control unit 10 executing a determination process program stored in the storage unit 20. It has a function with the associating unit 70.

対応付け部７０は、例えば、記憶装置３００から有害ＵＲＬデータ３１０を取得する。ベクトル化部４０は、取得した有害ＵＲＬデータ３１０の有害ＵＲＬの各々を、ＵＲＬベクトル化モデルＶＭを用いてベクトル化する。そして、対応付け部７０は、有害ＵＲＬのベクトルの各々に、有害を示すラベルを付与する。
また、対応付け部７０は、記憶装置４００から正規ＵＲＬデータ４１０を取得する。ベクトル化部４０は、取得した正規ＵＲＬデータ４１０の正規ＵＲＬの各々を、ＵＲＬベクトル化モデルＶＭを用いてベクトル化する。そして、対応付け部７０は、正規ＵＲＬのベクトルの各々に、正規を示すラベルを付与する。なお、対応付け部７０は、記憶装置２００から取得したＵＲＬデータ２１０の複数のＵＲＬのうち、有害ＵＲＬデータ３１０に含まれないＵＲＬの各々を、正規のＵＲＬとしても良い。
対応付け部７０は、複数の有害ＵＲＬにおいて対応付けしたベクトル及びラベルと、複数の正規のＵＲＬにおいて対応付けしたベクトル及びラベルとを、対応付けデータＴＤとして記憶部２０に記憶する。 The association unit 70 acquires the harmful URL data 310 from the storage device 300, for example. The vectorization unit 40 vectorizes each of the harmful URLs of the acquired harmful URL data 310 using the URL vectorization model VM. Then, the associating unit 70 assigns a label indicating harm to each of the harmful URL vectors.
The associating unit 70 acquires the regular URL data 410 from the storage device 400. The vectorization unit 40 vectorizes each of the normal URLs of the obtained normal URL data 410 using the URL vectorization model VM. Then, the associating unit 70 assigns a label indicating the normal to each of the normal URL vectors. In addition, the association unit 70 may set each of the URLs not included in the harmful URL data 310 among the plurality of URLs of the URL data 210 acquired from the storage device 200 as the regular URL.
The associating unit 70 stores, in the storage unit 20, the vectors and labels associated with the plurality of harmful URLs and the vectors and labels associated with the plurality of regular URLs as the association data TD.

判定部６０ｂは、対応付けデータＴＤを記憶部２０より読み込み、複数の有害ＵＲＬのベクトル及びラベルと、複数の正規のＵＲＬのベクトル及びラベルとを取得する。そして、判定部６０ｂは、ベクトル化部４０によりベクトル化された判定対象のＵＲＬのベクトルと、有害ＵＲＬ及び正規のＵＲＬそれぞれのベクトルとのコサイン類似度やユークリッド距離等を算出する。判定部６０ｂは、算出したコサイン類似度やユークリッド距離等と所定の閾値とを比較し、判定対象のＵＲＬのサイトが有害か否かを判定する。判定部６０ｂは、判定装置１００Ｂのディスプレイに判定結果を表示する。 The determination unit 60b reads the association data TD from the storage unit 20, and obtains a plurality of harmful URL vectors and labels and a plurality of regular URL vectors and labels. Then, the determination unit 60b calculates the cosine similarity, the Euclidean distance, and the like between the vector of the URL to be determined, which is vectorized by the vectorization unit 40, and the respective vectors of the harmful URL and the regular URL. The determination unit 60b compares the calculated cosine similarity, the Euclidean distance, and the like with a predetermined threshold value, and determines whether the site of the URL to be determined is harmful. The determination unit 60b displays the determination result on the display of the determination device 100B.

なお、判定部６０ｂは、コサイン類似度やユークリッド距離等を算出するにあたり、ラベルが最も高い有害性を示す有害ＵＲＬのベクトルから順に、判定対象のＵＲＬのベクトルとのコサイン類似度やユークリッド距離等を算出しても良い。 In calculating the cosine similarity, the Euclidean distance, and the like, the determination unit 60b calculates the cosine similarity, the Euclidean distance, and the like with the vector of the URL to be determined in order from the vector of the harmful URL indicating the highest harmfulness in the label. It may be calculated.

また、判定部６０ｂは、判定対象のＵＲＬのベクトルに判定結果に対応したラベルを付与し、対応付けデータＴＤを更新しても良い。これにより、判定装置１００Ｂは、有害なＵＲＬの判定精度を向上させることができる。 The determination unit 60b may add a label corresponding to the determination result to the vector of the URL to be determined, and update the association data TD. Thereby, the determination device 100B can improve the determination accuracy of the harmful URL.

図６Ａは、第３の実施形態に係る判定装置１００Ｂにおける生成処理を例示する図である。図６Ａに示した処理は、例えば、判定装置１００Ｂの管理者等が判定装置１００Ｂの入力装置を操作することにより実行される。 FIG. 6A is a diagram illustrating a generation process in the determination device 100B according to the third embodiment. The process illustrated in FIG. 6A is executed, for example, when the administrator of the determination device 100B operates the input device of the determination device 100B.

ステップＳ２１において、モデル生成部３０は、記憶装置２００に記憶されるＵＲＬデータ２１０のセットから必要な情報を抽出し、各ＵＲＬを短い文字列に分割し、このように分割された文字列となったＵＲＬを１つの文章とみなして、ＵＲＬベクトル化モデルＶＭ（ベクトル化モデルＶＭ）を生成する。モデル生成部３０は、生成したＵＲＬベクトル化モデルＶＭを記憶部２０に記憶する。 In step S21, the model generation unit 30 extracts necessary information from the set of URL data 210 stored in the storage device 200, divides each URL into short character strings, and obtains a character string thus divided. The generated URL is regarded as one sentence, and a URL vectorization model VM (vectorization model VM) is generated. The model generation unit 30 stores the generated URL vectorization model VM in the storage unit 20.

ステップＳ２２において、ベクトル化部４０は、記憶装置３００から有害ＵＲＬデータ３１０を取得する。ベクトル化部４０は、取得した有害ＵＲＬデータ３１０の有害ＵＲＬの各々を、ステップＳ２１で生成されたＵＲＬベクトル化モデルＶＭを用いてベクトル化する。また、ベクトル化部４０は、記憶装置４００から正規ＵＲＬデータ４１０を取得する。ベクトル化部４０は、取得した正規ＵＲＬデータ４１０の正規ＵＲＬの各々を、ＵＲＬベクトル化モデルＶＭを用いてベクトル化する。 In step S22, the vectorizing unit 40 acquires the harmful URL data 310 from the storage device 300. The vectorization unit 40 vectorizes each of the harmful URLs of the acquired harmful URL data 310 using the URL vectorization model VM generated in step S21. In addition, the vectorization unit 40 acquires the regular URL data 410 from the storage device 400. The vectorization unit 40 vectorizes each of the normal URLs of the obtained normal URL data 410 using the URL vectorization model VM.

ステップＳ２３において、対応付け部７０は、有害ＵＲＬのベクトルの各々に有害を示すラベルを付与する。また、対応付け部７０は、正規ＵＲＬのベクトルの各々に正規を示すラベルを付与する。そして、対応付け部７０は、複数の有害ＵＲＬのベクトル及びラベルと、複数の正規のＵＲＬのベクトル及びラベルとを、対応付けデータＴＤとして記憶部２０に記憶する。
なお、ステップＳ２１のＵＲＬベクトル化モデルＶＭの生成処理と、ステップＳ２２及びステップＳ２３の対応付けデータＴＤの生成処理とは、別々に実行されても良い。 In step S23, the associating unit 70 assigns a label indicating harm to each of the harmful URL vectors. In addition, the associating unit 70 assigns a label indicating normal to each of the normal URL vectors. Then, the association unit 70 stores the plurality of harmful URL vectors and labels and the plurality of regular URL vectors and labels in the storage unit 20 as association data TD.
The process of generating the URL vector model VM in step S21 and the process of generating the association data TD in steps S22 and S23 may be executed separately.

図６Ｂは、第３の実施形態に係る判定装置１００Ｂにおける判定処理を例示する図である。図６Ｂに示した処理は、例えば、判定装置１００Ｂの管理者等が判定装置１００Ｂの入力装置を操作することにより実行される。
ステップＳ２４において、判定部６０ｂは、記憶装置２００のＵＲＬデータ２１０等から、判定対象のＵＲＬを取得する。
ステップＳ２５において、ベクトル化部４０は、記憶部２０からＵＲＬベクトル化モデルＶＭを読み込み、ＵＲＬベクトル化モデルＶＭを用いて、ステップＳ２４で取得した判定対象のＵＲＬの文字列をベクトル化する。
ステップＳ２６において、判定部６０ｂは、対応付けデータＴＤを記憶部２０より読み込み、複数の有害ＵＲＬのベクトル及びラベルと、複数の正規ＵＲＬのベクトル及びラベルとを取得する。そして、判定部６０ｂは、ステップＳ２５でベクトル化された判定対象のＵＲＬのベクトルと、有害ＵＲＬ及び正規のＵＲＬそれぞれのベクトルとのコサイン類似度やユークリッド距離等を算出する。 FIG. 6B is a diagram illustrating a determination process in the determination device 100B according to the third embodiment. The process illustrated in FIG. 6B is executed, for example, when the administrator of the determination device 100B operates the input device of the determination device 100B.
In step S24, the determination unit 60b acquires the URL to be determined from the URL data 210 or the like of the storage device 200.
In step S25, the vectorization unit 40 reads the URL vectorization model VM from the storage unit 20, and vectorizes the character string of the URL to be determined acquired in step S24 using the URL vectorization model VM.
In step S26, the determination unit 60b reads the association data TD from the storage unit 20, and obtains a plurality of harmful URL vectors and labels and a plurality of normal URL vectors and labels. Then, the determining unit 60b calculates the cosine similarity, the Euclidean distance, and the like between the vector of the URL to be determined vectorized in step S25 and the vectors of the harmful URL and the regular URL.

例えば、判定部６０ｂは、コサイン類似度が閾値以上で類似する有害ＵＲＬのベクトルがある場合、判定対象のＵＲＬのサイトを有害と判定する。一方、判定部６０ｂは、コサイン類似度が閾値以上で類似する有害ＵＲＬのベクトルがない場合、判定対象のＵＲＬのサイトを正規と判定する。あるいは、判定部６０ｂは、ユークリッド距離が閾値以下で類似する有害ＵＲＬのベクトルがある場合、判定対象のＵＲＬのサイトを有害と判定する。一方、判定部６０ｂは、ユークリッド距離が閾値以下で類似する有害ＵＲＬのベクトルがない場合、判定対象のＵＲＬのサイトを正規と判定する。そして、判定部６０ｂは、判定結果を判定装置１００Ｂのディスプレイに表示する。 For example, when there is a vector of a harmful URL whose cosine similarity is equal to or greater than a threshold value, the determination unit 60b determines that the site of the URL to be determined is harmful. On the other hand, when there is no harmful URL vector having a cosine similarity equal to or greater than the threshold value and there is no similar harmful URL, the determination unit 60b determines that the site of the URL to be determined is normal. Alternatively, when there is a vector of a similar harmful URL whose Euclidean distance is equal to or less than the threshold, the determination unit 60b determines that the site of the URL to be determined is harmful. On the other hand, when there is no similar harmful URL vector whose Euclidean distance is equal to or smaller than the threshold, the determination unit 60b determines that the site of the URL to be determined is normal. Then, the determination unit 60b displays the determination result on the display of the determination device 100B.

以上説明したように、第３の実施形態では、判定装置１００Ｂは、有害ＵＲＬ及び正規のＵＲＬそれぞれのベクトルと、判定対象のＵＲＬのベクトルとの類似の度合いを算出し、判定対象のＵＲＬのサイトが有害か否かを判定する。これにより、判定装置１００Ｂは、任意のフィルタリングルール（シグネチャ等）を利用しなくても、判定対象のＵＲＬのサイトが有害か否かの判定の精度良く判定できる。また、判定装置１００Ｂは、判定結果に基づいてＵＲＬベクトル化モデルＶＭ及び対応付けデータＴＤを常に更新することにより、日々変化する攻撃者のＵＲＬの自動的な追従が可能となり、即応性の向上を図れる。 As described above, in the third embodiment, the determination device 100B calculates the degree of similarity between each vector of the harmful URL and the regular URL and the vector of the URL to be determined, and determines the site of the URL to be determined. Is harmful or not. Accordingly, the determination device 100B can accurately determine whether or not the site of the URL to be determined is harmful without using any filtering rule (such as a signature). In addition, the determination device 100B constantly updates the URL vectorization model VM and the association data TD based on the determination result, thereby enabling automatic tracking of an attacker's URL that changes daily, thereby improving responsiveness. I can do it.

また、判定装置１００Ｂは、記憶装置２００のＵＲＬデータ２１０、記憶装置３００の有害ＵＲＬデータ３１０及び記憶装置４００の正規ＵＲＬデータ４１０を取得した後、判定装置１００Ｂ内で判定処理を実行するため、外部に問い合わせる際のアクセスデータの漏えいを回避できる。また、判定装置１００Ｂは、判定装置１００Ｂ内で判定処理を実行するため、外部との通信コスト（時間的、金銭的）も低減できる。 In addition, after acquiring the URL data 210 of the storage device 200, the harmful URL data 310 of the storage device 300, and the regular URL data 410 of the storage device 400, the determination device 100B executes a determination process in the determination device 100B. Access data can be prevented from leaking when making inquiries. In addition, since the determination device 100B performs the determination process in the determination device 100B, the cost of communication with the outside (time and money) can be reduced.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 The embodiments of the present invention have been described above, but the present invention is not limited to the above-described embodiments. In addition, the effects described in the above-described embodiments merely enumerate the most preferable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

［実施形態の変形例１］
第１の実施形態に係る判定装置１００は、ベクトル化部４０、学習部５０及び判定部６０を、判定装置１００内に配置したが、例えばクラウドを含む外部装置に分散して配置されても良い。また、第２の実施形態に係る判定装置１００Ａは、ベクトル化部４０ａ、学習部５０ａ及び判定部６０ａを、クラウドを含む外部装置に分散して配置されても良い。また、第３の実施形態に係る判定装置１００Ｂは、ベクトル化部４０、判定部６０ｂ及び対応付け部７０を、クラウドを含む外部装置に分散して配置されても良い。
例えば、クラウド等の別のモデル生成装置に対して、判定装置１００は、モデル生成装置にアクセスすることで、ＵＲＬベクトル化モデルＶＭを生成するようにしてもよい。 [Modification 1 of Embodiment]
In the determination device 100 according to the first embodiment, the vectorization unit 40, the learning unit 50, and the determination unit 60 are arranged in the determination device 100, but may be distributed and arranged in an external device including a cloud, for example. . Also, in the determination device 100A according to the second embodiment, the vectorization unit 40a, the learning unit 50a, and the determination unit 60a may be distributed and arranged in an external device including a cloud. In addition, in the determination device 100B according to the third embodiment, the vectorization unit 40, the determination unit 60b, and the association unit 70 may be distributed and arranged in an external device including a cloud.
For example, with respect to another model generation device such as a cloud, the determination device 100 may generate the URL vectorized model VM by accessing the model generation device.

［実施形態の変形例２］
第１の実施形態に係る判定装置１００及び第３の実施形態に係る判定装置１００Ｂは、記憶装置２００のＵＲＬデータ２１０を用いてＵＲＬベクトル化モデルＶＭを生成したが、例えば、ＵＲＬベクトル化モデルＶＭは、予め外部のコンピュータにより生成され、記憶装置２００等に記憶されても良い。この場合、判定装置１００及び判定装置１００Ｂは、記憶装置２００からＵＲＬベクトル化モデルＶＭを取得し、記憶部２０に記憶する。
同様に、第２の実施形態において、構造別ＵＲＬベクトル化モデルＶＭ（１）−ＶＭ（Ｋ）は、予め外部のコンピュータにより生成され、記憶装置２００等に記憶されても良い。そうすることで、判定装置１００Ａは、記憶装置２００から構造別ＵＲＬベクトル化モデルＶＭ（１）−ＶＭ（Ｋ）を取得しても良い。 [Modification 2 of Embodiment]
Although the determination device 100 according to the first embodiment and the determination device 100B according to the third embodiment generate the URL vectorization model VM using the URL data 210 of the storage device 200, for example, the URL vectorization model VM May be generated in advance by an external computer and stored in the storage device 200 or the like. In this case, the determination device 100 and the determination device 100B obtain the URL vectorization model VM from the storage device 200 and store the URL vector model VM in the storage unit 20.
Similarly, in the second embodiment, the structure-specific URL vectorization models VM (1) to VM (K) may be generated in advance by an external computer and stored in the storage device 200 or the like. By doing so, the determination device 100A may acquire the structure-specific URL vectorization model VM (1) -VM (K) from the storage device 200.

１０制御部
２０記憶部
３０モデル生成部
４０ベクトル化部
５０学習部
６０判定部
１００判定装置 Reference Signs List 10 control unit 20 storage unit 30 model generation unit 40 vectorization unit 50 learning unit 60 judgment unit 100 judgment device

Claims

A vectorization unit that vectorizes a character string of an arbitrary URL using a URL vectorization model for vectorizing a character string included in the URL as one sentence;
A vector in which a plurality of URL character strings that have been previously determined to be harmful or not by the vectorization unit are vectorized, and a vector in which the character string of the URL to be determined is vectorized by the vectorization unit is: A determining unit that determines whether the site indicated by the URL to be determined is harmful;
A determination device comprising:

By performing machine learning using, as teacher data, a label indicating a determination result as to whether or not each of the plurality of URLs is harmful and a vector in which the character string of the plurality of URLs is vectorized by the vectorization unit. A learning unit that generates a determination model for determining whether a site indicated by an arbitrary URL is harmful,
The determination device according to claim 1, wherein the determination unit determines whether a site indicated by the URL to be determined is harmful using the determination model generated by the learning unit.

A mapping unit that generates a mapping data by associating a label indicating a determination result as to whether or not each of the plurality of URLs is harmful with a vector in which the character strings of the plurality of URLs are vectorized. In addition,
The determination device according to claim 1, wherein the determination unit determines whether a site indicated by the determination target URL is harmful using the generated association data.

The URL vectorization model converts the URL character string into at least a query part, a path part, and a structure of a host name, and converts the short character string generated into a single sentence into a sentence vector. A vectorization model for generating a short string vector for each structure from
The vectorization unit vectorizes a character string of an arbitrary URL by the structure using the URL vectorization model,
The determination unit includes:
A vector generated for each structure from the character strings of a plurality of URLs that are determined in advance as harmful or not by the vectorization unit, and a vector generated for each structure from the character string of the URL to be determined by the vectorization unit. The determination apparatus according to claim 1, wherein the determination unit determines whether the site indicated by the URL to be determined is harmful using a vector and the vector.

The determination device according to claim 4, wherein the determination unit selects the structure of the URL according to a type of a harmful site.

The URL vectorization model further comprises:
A vectorization model for generating a connected vector by connecting vectors generated for each structure from a URL character string,
The vectorization unit further includes:
Using the URL vectorization model to generate a connected vector obtained by connecting the vectors generated for each structure from a character string of an arbitrary URL,
The determination unit further includes:
The connected vector generated from the character strings of the plurality of URLs that have been previously determined to be harmful or not by the vectorizing unit, and the connected vector generated from the character string of the URL to be determined by the vectorizing unit, The determination device according to claim 4, wherein the determination unit determines whether the site indicated by the URL to be determined is harmful.

A determination method implemented by a computer,
A vectorization step of vectorizing an arbitrary character string using a URL vectorization model for vectorizing the character string included in the URL as one sentence,
In the vectorization step, the determination is performed by using a vector obtained by vectorizing a plurality of URL character strings that have been determined to be harmful or not and a vector obtained by vectorizing the URL character string to be determined. A determining step of determining whether the site indicated by the target URL is harmful;
A determination method comprising: