JP7435744B2

JP7435744B2 - Identification method, identification device and identification program

Info

Publication number: JP7435744B2
Application number: JP2022510295A
Authority: JP
Inventors: 駿飛山; 博胡; 和憲神谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2024-02-21
Anticipated expiration: 2040-03-26
Also published as: US20230136929A1; WO2021192186A1; JPWO2021192186A1

Description

特許法第３０条第２項適用電子情報通信学会信学技報情報通信マネジメント研究会（ＩＣＭ）ｖｏｌ．１１９Ｎｏ．４３８ＩＣＭ２０１９－５１ｐｐ．５５－６０発行日２０２０年２月２４日Application of Article 30, Paragraph 2 of the Patent Act IEICE Technical Report on Information and Communication Management (ICM) vol. 119 No. 438 ICM2019-51 pp. 55-60 Publication date February 24, 2020

本発明は、識別方法、識別装置及び識別プログラムに関する。 The present invention relates to an identification method, an identification device, and an identification program.

アプリケーション識別のために教師あり学習で識別器を作成する場合、大量のデータと各データポイントに対応したラベルとが必要となる。ここで、従来、パケットデータを用いてフローデータにラベルを付加する技術や、パケットデータを用いて特徴抽出を行う技術がある。 When creating a classifier using supervised learning for application identification, a large amount of data and labels corresponding to each data point are required. Conventionally, there are techniques for adding labels to flow data using packet data and techniques for extracting features using packet data.

T. Karagiannis, K. Papagiannaki and M. Faloutsos, “BLINC: Multilevel Traffic Classification in the Dark”, Proceedings of the ACM SIGCOMM 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Philadelphia, Pennsylvania, USA, August 22-26, 2005T. Karagiannis, K. Papagiannaki and M. Faloutsos, “BLINC: Multilevel Traffic Classification in the Dark”, Proceedings of the ACM SIGCOMM 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Philadelphia, Pennsylvania, USA, August 22-26, 2005 Z. Chen, K. He, J. Li and Y. Geng “Seq2Img: A Sequence-to-Image based Approach Towards IP Traffic Classification using Convolutional Neural Networks”, 2017 IEEE International Conference on Big Data (Big Data).Z. Chen, K. He, J. Li and Y. Geng “Seq2Img: A Sequence-to-Image based Approach Towards IP Traffic Classification using Convolutional Neural Networks”, 2017 IEEE International Conference on Big Data (Big Data).

しかしながら、アプリケーションレベルのラベルを付加するにあたり、フローデータを用いる場合、フローデータにＩＰアドレスやポート番号等の簡易な情報しか含まれないため、ラベルの付加が難しく精度も低いという問題があった。また、パケットデータを用いる場合、対象ネットワークの規模が大きいほど収集及び分析にかかる負荷も高いため、アプリケーションレベルのラベル付加が困難であり、大規模なネットワークでは適用が難しいという問題があった。 However, when using flow data to add application-level labels, there is a problem in that it is difficult to add labels and the accuracy is low because the flow data includes only simple information such as an IP address and a port number. Furthermore, when using packet data, the larger the target network, the higher the load on collection and analysis, making it difficult to add labels at the application level, making it difficult to apply to large-scale networks.

本発明は、上記に鑑みてなされたものであって、大規模なネットワークにおいても、トラフィックを発生させたアプリケーションを適切に識別することができる識別方法、識別装置及び識別プログラムを提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide an identification method, an identification device, and an identification program that can appropriately identify an application that generates traffic even in a large-scale network. shall be.

上述した課題を解決し、目的を達成するために、本発明に係る識別方法は、アプリケーションを識別する識別装置が実行する識別方法であって、所定のルールを満たすパケットデータ及び第１のフローデータを収集する収集工程と、パケットデータを分析してアプリケーションとＩＰアドレスとを対応付けるシグネチャを生成するシグネチャ生成工程と、パケットデータから第２のフローデータを生成するフローデータ生成工程と、第１のフローデータについてＩＰアドレスごとの統計的な特徴量である第１の特徴量情報を計算し、第２のフローデータについてＩＰアドレスごとの統計的な特徴量である第２の特徴量情報を計算する計算工程と、シグネチャを用いて第２の特徴量情報にラベルを付加する付加工程と、識別器に、第１の特徴量情報及び第２の特徴量情報を学習データとしてアプリケーションの識別を学習させる学習工程と、を含んだことを特徴とする。 In order to solve the above-mentioned problems and achieve the purpose, an identification method according to the present invention is an identification method executed by an identification device that identifies an application, and includes packet data and first flow data that satisfy a predetermined rule. a signature generation step that analyzes the packet data and generates a signature that associates the application with the IP address; a flow data generation step that generates second flow data from the packet data; Calculation of calculating first feature amount information, which is a statistical feature amount for each IP address, for the data, and calculating second feature amount information, which is a statistical feature amount for each IP address, for the second flow data. a step of adding a label to the second feature information using the signature; and a learning step of causing the discriminator to learn application identification using the first feature information and the second feature information as learning data. It is characterized by including a process.

また、本発明に係る識別装置は、アプリケーションを識別する識別装置であって、所定のルールを満たすパケットデータ及び第１のフローデータを収集する収集部と、パケットデータを分析してアプリケーションとＩＰアドレスとを対応付けるシグネチャを生成するシグネチャ生成部と、パケットデータから第２のフローデータを生成するフローデータ生成部と、第１のフローデータについてＩＰアドレスごとの統計的な特徴量である第１の特徴量情報を計算し、第２のフローデータについてＩＰアドレスごとの統計的な特徴量である第２の特徴量情報を計算する特徴量計算部と、シグネチャを用いて第２の特徴量情報にラベルを付加するラベル付加部と、識別器に、第１の特徴量情報及び第２の特徴量情報を学習データとしてアプリケーションの識別を学習させる学習部と、を有することを特徴とする。 Further, the identification device according to the present invention is an identification device that identifies an application, and includes a collection unit that collects packet data and first flow data that satisfy a predetermined rule, and an application and IP address that analyze the packet data. a flow data generation unit that generates second flow data from packet data; and a first feature that is a statistical feature amount for each IP address for the first flow data. a feature amount calculation unit that calculates second feature amount information that is a statistical feature amount for each IP address for the second flow data; The present invention is characterized by comprising a label adding unit that adds a label, and a learning unit that causes a discriminator to learn application identification using the first feature information and the second feature information as learning data.

また、本発明に係る識別プログラムは、所定のルールを満たすパケットデータ及び第１のフローデータを収集する収集ステップと、パケットデータを分析してアプリケーションとＩＰアドレスとを対応付けるシグネチャを生成する第１の生成ステップと、パケットデータから第２のフローデータを生成する第２の生成ステップと、第１のフローデータについてＩＰアドレスごとの統計的な特徴量である第１の特徴量情報を計算し、第２のフローデータについてＩＰアドレスごとの統計的な特徴量である第２の特徴量情報を計算する計算ステップと、シグネチャを用いて第２の特徴量情報にラベルを付加する付加ステップと、識別器に、第１の特徴量情報及び第２の特徴量情報を学習データとしてアプリケーションの識別を学習させる学習ステップと、をコンピュータに実行させる。 The identification program according to the present invention also includes a collection step of collecting packet data and first flow data that satisfy a predetermined rule, and a first step of analyzing the packet data to generate a signature that associates an application with an IP address. a generation step, a second generation step of generating second flow data from packet data, and calculating first feature information, which is a statistical feature for each IP address, for the first flow data; a calculation step of calculating second feature information, which is a statistical feature for each IP address, for the flow data of No. 2; an addition step of adding a label to the second feature information using a signature; and a discriminator. and a learning step of learning application identification using the first feature information and the second feature information as learning data.

本発明によれば、時空間データを含むデータ検索において、大規模なネットワークにおいても、トラフィックを発生させたアプリケーションを適切に識別することができる。 According to the present invention, in a data search including spatio-temporal data, an application that generates traffic can be appropriately identified even in a large-scale network.

図１は、実施の形態における通信システムの構成の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of the configuration of a communication system in an embodiment. 図２は、実施の形態に係る学習処理の処理手順を示すフローチャートである。FIG. 2 is a flowchart showing the processing procedure of the learning process according to the embodiment. 図３は、実施の形態に係る識別処理の処理手順を示すフローチャートである。FIG. 3 is a flowchart showing the processing procedure of the identification process according to the embodiment. 図４は、実施の形態に係る識別装置の適用例を説明する図である。FIG. 4 is a diagram illustrating an application example of the identification device according to the embodiment. 図５は、実施の形態に係る識別装置１０の他の適用例を説明する図である。FIG. 5 is a diagram illustrating another application example of the identification device 10 according to the embodiment. 図６は、プログラムが実行されることにより、識別装置が実現されるコンピュータの一例を示す図である。FIG. 6 is a diagram illustrating an example of a computer that implements an identification device by executing a program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施の形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, one embodiment of the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited to this embodiment. In addition, in the description of the drawings, the same parts are denoted by the same reference numerals.

［実施の形態］
図１は、実施の形態における通信システムの構成の一例を示すブロック図である。図１に示すように、実施の形態における通信システムでは、小規模ネットワーク（ＮＷ）機器２Ａ，２Ｂと、識別対象ＮＷルータ３Ａ，３Ｂと、識別装置１０と、を有する。複数の小規模ＮＷ機器２Ａ，２Ｂと、複数の識別対象ＮＷルータ３Ａ，３Ｂと、識別装置１０とは、ネットワークを介して通信を行う。なお、図１においては、小規模ＮＷ機器２Ａ，２Ｂ及び識別対象ＮＷルータ３Ａ，３Ｂは、複数である場合を示すが、それぞれ単数であってもよい。 [Embodiment]
FIG. 1 is a block diagram showing an example of the configuration of a communication system in an embodiment. As shown in FIG. 1, the communication system according to the embodiment includes small-scale network (NW) devices 2A and 2B, NW routers to be identified 3A and 3B, and an identification device 10. The plurality of small-scale NW devices 2A, 2B, the plurality of identification target NW routers 3A, 3B, and the identification device 10 communicate via a network. Although FIG. 1 shows a case in which there are a plurality of small-scale NW devices 2A, 2B and identification target NW routers 3A, 3B, each may be a single number.

小規模ＮＷ機器２Ａ，２Ｂは、小規模ＮＷにおいて、トラフィックのミラーリングなどを行うことによって、小規模ＮＷのトラフィックデータを識別装置１０に送信する。小規模ＮＷ機器２Ａ，２Ｂは、小規模ＮＷのパケットデータＤ１を識別装置１０に送信する。 The small-scale NW devices 2A and 2B transmit traffic data of the small-scale NW to the identification device 10 by performing traffic mirroring in the small-scale NW. The small-scale NW devices 2A and 2B transmit the small-scale NW packet data D1 to the identification device 10.

識別対象ＮＷルータ３Ａ，３Ｂは、アプリケーションの識別対象ＮＷに設けられたルータであり、識別対象ＮＷにおいて、フロー収集機能などを用いて、識別対象ＮＷのネットワークフローデータ（フローデータ）Ｄ２を収集し、識別装置１０に送信する。 The identification target NW routers 3A and 3B are routers provided in the identification target NW of the application, and collect network flow data (flow data) D2 of the identification target NW using a flow collection function or the like in the identification target NW. , to the identification device 10.

識別装置１０は、識別対象ＮＷにおけるフローデータから、トラフィックを発生させたアプリケーション（例えば、Ｗｅｂアプリケーション）を識別する。識別装置１０は、識別器に、小規模ＮＷのデータから生成したラベルありの学習データでアプリケーションの識別を事前学習させた後に、Domain Adaptationを用いてラベルのない識別対象ＮＷのフローデータを学習に使用する。これによって、識別装置１０は、大規模な識別対象ＮＷにおけるフローデータにおいても、アプリケーションを識別可能な識別器を構築する。 The identification device 10 identifies an application (for example, a web application) that has generated traffic from the flow data in the identification target NW. The identification device 10 causes the discriminator to pre-train application identification using labeled learning data generated from data of a small-scale NW, and then uses Domain Adaptation to perform learning on unlabeled flow data of the NW to be identified. use. Thereby, the identification device 10 constructs a classifier that can identify applications even in flow data in a large-scale identification target NW.

［識別装置］
次に、図１を参照して、識別装置１０について説明する。図１に示すように、識別装置１０は、収集部１１、シグネチャ生成部１２、フローデータ生成部１３、シグネチャデータベース（ＤＢ）１４、特徴量計算部１５、ラベル付加部１６、識別器学習部１７（学習部）、学習済み識別器１８、アプリケーション識別部１９（識別部）、及び、出力部２０を有する。 [Identification device]
Next, the identification device 10 will be explained with reference to FIG. As shown in FIG. 1, the identification device 10 includes a collection unit 11, a signature generation unit 12, a flow data generation unit 13, a signature database (DB) 14, a feature calculation unit 15, a label addition unit 16, and a classifier learning unit 17. (learning unit), a learned classifier 18, an application identifying unit 19 (identifying unit), and an output unit 20.

なお、識別装置１０は、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＣＰＵ（Central Processing Unit）等を含むコンピュータ等に所定のプログラムが読み込まれて、ＣＰＵが所定のプログラムを実行することで実現される。また、識別装置１０は、ネットワーク等を介して接続された他の装置との間で、各種情報を送受信する通信インタフェースを有する。例えば、識別装置１０は、ＮＩＣ（Network Interface Card）等を有し、ＬＡＮ（Local Area Network）やインターネットなどの電気通信回線を介した他の装置との間の通信を行う。 Note that the identification device 10 is configured such that a predetermined program is read into a computer or the like including, for example, ROM (Read Only Memory), RAM (Random Access Memory), CPU (Central Processing Unit), etc., and the CPU executes the predetermined program. This is achieved by doing. The identification device 10 also has a communication interface for transmitting and receiving various information with other devices connected via a network or the like. For example, the identification device 10 includes a NIC (Network Interface Card) and the like, and communicates with other devices via a telecommunications line such as a LAN (Local Area Network) or the Internet.

収集部１１は、所定のルールを満たすパケットデータ及びフローデータを収集する。収集部１１は、学習時には、小規模ＮＷ機器２Ａ，２Ｂから送信された小規模ＮＷのパケットデータＤ１と、識別対象ＮＷルータ３Ａ，３Ｂから送信された大規模ＮＷである識別対象ＮＷのフローデータＤ２（第１のフローデータ）とを収集する。小規模ＮＷのパケットデータＤ１は、後段の処理によって、ラベルを付加可能である程度の規模である小規模ＮＷのパケットデータである。 The collection unit 11 collects packet data and flow data that satisfy predetermined rules. During learning, the collection unit 11 collects packet data D1 of the small-scale NW transmitted from the small-scale NW devices 2A and 2B, and flow data of the identification target NW, which is a large-scale NW, transmitted from the identification target NW routers 3A and 3B. D2 (first flow data). The small-scale NW packet data D1 is packet data of a small-scale NW that is of a certain size to which a label can be added through subsequent processing.

そして、収集部１１は、学習時には、小規模ＮＷのパケットデータＤ１をシグネチャ生成部１２及びフローデータ生成部１３に出力する。また、収集部１１は、学習時には、第１のフローデータを特徴量計算部１５に出力する。収集部１１は、識別時には、識別対象となる識別対象ＮＷのフローデータを収集し、特徴量計算部１５に出力する。 Then, during learning, the collection unit 11 outputs the packet data D1 of the small-scale NW to the signature generation unit 12 and the flow data generation unit 13. Furthermore, the collection unit 11 outputs the first flow data to the feature calculation unit 15 during learning. At the time of identification, the collection unit 11 collects flow data of the identification target NW to be identified, and outputs it to the feature value calculation unit 15.

シグネチャ生成部１２は、小規模ＮＷのパケットデータＤ１を分析してアプリケーションとＩＰアドレスとを対応付けるシグネチャを生成する。シグネチャ生成部１２は、小規模ＮＷにおいて収集されたパケットデータをＤＰＩ装置などで分析して、パケットデータを発生させたアプリケーションカテゴリを示すラベル（例えば、アプリケーションの名称）と、送信元ＩＰアドレス、送信先ＩＰアドレス、ポート番号、及び、パケットを記録した時間の組と、を対応させたシグネチャを作成する。 The signature generation unit 12 analyzes the packet data D1 of the small-scale NW and generates a signature that associates applications with IP addresses. The signature generation unit 12 analyzes the packet data collected in the small-scale NW using a DPI device or the like, and generates a label (for example, the name of the application) indicating the application category that generated the packet data, the source IP address, and the transmission source IP address. A signature is created by associating the destination IP address, port number, and time at which the packet was recorded.

フローデータ生成部１３は、小規模ＮＷのパケットデータＤ１から第２のフローデータを生成する。 The flow data generation unit 13 generates second flow data from the packet data D1 of the small-scale NW.

シグネチャＤＢ１４は、シグネチャ生成部１２が生成した、アプリケーションカテゴリを示すラベルと、送信元のＩＰアドレス、送信先のＩＰアドレス、ポート番号、及び、パケットを記録した時間の組と、を対応付けて記憶する。 The signature DB 14 associates and stores a label indicating an application category generated by the signature generation unit 12 and a set of a source IP address, a destination IP address, a port number, and the time at which the packet was recorded. do.

特徴量計算部１５は、学習時には、識別対象ＮＷのフローデータＤ２である第１のフローデータについてＩＰアドレスごとの統計的な特徴量である第１の特徴量情報を計算する。特徴量計算部１５は、学習時には、フローデータ生成部１３が小規模ＮＷのパケットデータＤ１から生成した第２のフローデータについてＩＰアドレスごとの統計的な特徴量である第２の特徴量情報を計算する。また、特徴量計算部１５は、識別時には、識別対象である識別対象ＮＷのフローデータについてＩＰアドレスごとの統計的な特徴量である識別用特徴量情報を計算する。 During learning, the feature calculation unit 15 calculates first feature information, which is a statistical feature for each IP address, for the first flow data, which is the flow data D2 of the identification target NW. During learning, the feature calculation unit 15 calculates second feature information, which is a statistical feature for each IP address, for the second flow data generated by the flow data generation unit 13 from the packet data D1 of the small-scale NW. calculate. Further, at the time of identification, the feature calculation unit 15 calculates identification feature information, which is a statistical feature for each IP address, for the flow data of the identification target NW that is the identification target.

特徴量計算部１５は、２４時間あたりの、あるＩＰアドレスを送信元及び／または送信先とするフローデータの集合から、パケット数のヒストグラム、バイト数のヒストグラム、または、バイト及びパケット数のヒストグラムの少なくともいずれか一つを計算する。具体的には、特徴量計算部１５は、第１のフローデータについて、送信先ＩＰアドレス及び送信元ＩＰアドレスごとに１パケットあたりのバイト数の平均等の統計量を計算し、第１の特徴量情報として抽出する。特徴量計算部１５は、第２のフローデータについて、送信先ＩＰアドレス及び送信元ＩＰアドレスごとに１パケットあたりのバイト数の平均等の統計量を計算し、第２の特徴量情報として抽出する。 The feature calculation unit 15 calculates a histogram of the number of packets, a histogram of the number of bytes, or a histogram of the number of bytes and packets from a set of flow data with a certain IP address as the source and/or destination per 24 hours. Calculate at least one of them. Specifically, the feature calculation unit 15 calculates statistics such as the average number of bytes per packet for each destination IP address and source IP address for the first flow data, and calculates the first feature. Extract as quantity information. The feature calculation unit 15 calculates statistics such as the average number of bytes per packet for each destination IP address and source IP address for the second flow data, and extracts it as second feature information. .

ラベル付加部１６は、学習時には、シグネチャ生成部１２が生成したシグネチャを用いて第２の特徴量情報にラベルを付加する。 During learning, the label adding unit 16 uses the signature generated by the signature generating unit 12 to add a label to the second feature amount information.

識別器学習部１７は、識別器に、第１の特徴量情報及び第２の特徴量情報を学習データとして、アプリケーションの識別を学習させる。識別器学習部１７は、ラベル付加部１６が生成されたラベル付きの第２の特徴量情報を用いて、識別器の事前学習を行う。その後、識別器学習部１７は、第１の特徴量情報とラベルなしの第２の特徴量情報とを用いて、ドメイン適用技術により、識別器の学習を行う。識別器学習部１７は、事前学習で得られた識別器と第１の特徴量情報及びラベルなしの第２の特徴量情報とを用い、Domain Adaptationにより識別器の学習を行う。 The classifier learning unit 17 causes the classifier to learn application identification using the first feature amount information and the second feature amount information as learning data. The classifier learning unit 17 performs preliminary training of the classifier using the labeled second feature amount information generated by the label adding unit 16. Thereafter, the classifier learning unit 17 uses the first feature amount information and the unlabeled second feature amount information to perform learning of the classifier using the domain application technique. The classifier learning unit 17 performs classifier learning by Domain Adaptation using the classifier obtained through preliminary learning, the first feature information, and the unlabeled second feature information.

学習済み識別器１８は、識別器学習部１７における事前学習及び学習によって、識別対象であるフローデータのＩＰアドレスに対応するアプリケーションを識別することが可能となった識別器である。具体的には、学習済み識別器１８は、識別対象であるフローデータの特徴量情報を入力とし、識別対象であるフローデータのＩＰアドレスが各アプリケーションを提供している確率を出力する。 The trained discriminator 18 is a discriminator that has become capable of identifying an application corresponding to the IP address of the flow data to be identified through prior learning and learning in the discriminator learning unit 17. Specifically, the learned classifier 18 inputs the feature amount information of the flow data to be identified, and outputs the probability that the IP address of the flow data to be identified provides each application.

アプリケーション識別部１９は、学習済み識別器１８を用いて、識別対象であるフローデータのＩＰアドレスに対応するアプリケーションを識別する。アプリケーション識別部１９は、識別時において、識別用特徴量情報を学習済み識別器１８に入力し、学習済み識別器１８から出力された識別結果を基に、識別対象であるフローデータのＩＰアドレスに対応するアプリケーションを識別する。出力部２０は、アプリケーション識別部１９は、による識別結果を、例えば、外部装置等に出力する。 The application identifying unit 19 uses the learned classifier 18 to identify an application corresponding to the IP address of the flow data to be identified. At the time of identification, the application identification unit 19 inputs the identification feature information to the trained classifier 18, and based on the classification result output from the trained classifier 18, the application identification unit 19 inputs the identification feature information to the learned classifier 18, and uses the IP address of the flow data to be identified based on the classification result output from the learned classifier 18. Identify the corresponding application. The output unit 20 outputs the identification result obtained by the application identification unit 19 to, for example, an external device.

［学習処理］
次に、図１に示す識別装置１０が実行する識別器に対する学習処理について説明する。図２は、実施の形態に係る学習処理の処理手順を示すフローチャートである。 [Learning process]
Next, a learning process for the classifier executed by the classifier 10 shown in FIG. 1 will be described. FIG. 2 is a flowchart showing the processing procedure of the learning process according to the embodiment.

図２に示すように、収集部１１は、小規模ＮＷのパケットデータＤ１と、識別対象ＮＷのフローデータＤ２（第１のフローデータ）を収集する収集処理を行う（ステップＳ１）。 As shown in FIG. 2, the collection unit 11 performs a collection process to collect packet data D1 of the small-scale NW and flow data D2 (first flow data) of the identified NW (step S1).

そして、シグネチャ生成部１２は、小規模ＮＷのパケットデータＤ１を分析してアプリケーションとＩＰアドレスとを対応付けるシグネチャを生成する（ステップＳ２）。また、フローデータ生成部１３は、小規模ＮＷのパケットデータＤ１から第２のフローデータを生成する（ステップＳ３）。 Then, the signature generation unit 12 analyzes the packet data D1 of the small-scale NW and generates a signature that associates the application with the IP address (step S2). Furthermore, the flow data generation unit 13 generates second flow data from the packet data D1 of the small-scale NW (step S3).

特徴量計算部１５は、第２のフローデータについてＩＰアドレスごとの統計的な特徴量である第２の特徴量情報を計算する（ステップＳ４）。ラベル付加部１６は、学習時には、シグネチャ生成部１２が生成したシグネチャを用いて第２の特徴量情報にラベルを付加する（ステップＳ５）。識別器学習部１７は、ラベル付加部１６が生成されたラベル付きの第２の特徴量情報を用いて、識別器の事前学習を行う（ステップＳ６）。 The feature calculation unit 15 calculates second feature information, which is a statistical feature for each IP address, for the second flow data (step S4). During learning, the label adding unit 16 adds a label to the second feature amount information using the signature generated by the signature generating unit 12 (step S5). The classifier learning unit 17 performs preliminary training of the classifier using the labeled second feature amount information generated by the label adding unit 16 (step S6).

また、特徴量計算部１５は、第１のフローデータについてＩＰアドレスごとの統計的な特徴量である第１の特徴量情報を計算する（ステップＳ７）。識別器学習部１７は、事前学習で得られた識別器と第１の特徴量情報及びラベルなしの第２の特徴量情報とを用い、Domain Adaptationにより識別器の学習を行う（ステップＳ８）。そして、識別器学習部１７が、学習済み識別器１８を生成する。 Further, the feature calculation unit 15 calculates first feature information, which is a statistical feature for each IP address, for the first flow data (step S7). The classifier learning unit 17 performs classifier learning by Domain Adaptation using the classifier obtained through preliminary learning, the first feature information, and the unlabeled second feature information (step S8). Then, the classifier learning unit 17 generates a trained classifier 18.

［識別処理］
次に、図１に示す識別装置１０が実行する、識別対象ＮＷのフローデータのＩＰアドレスに対応するアプリケーションを識別する識別処理について説明する。図３は、実施の形態に係る識別処理の処理手順を示すフローチャートである。 [Identification processing]
Next, an identification process for identifying an application corresponding to an IP address of flow data of an identification target NW, which is executed by the identification device 10 shown in FIG. 1, will be described. FIG. 3 is a flowchart showing the processing procedure of the identification process according to the embodiment.

図３に示すように、収集部１１は、識別時には、識別対象となる大規模ＮＷである識別対象ＮＷのフローデータを収集する（ステップＳ１１）。続いて、特徴量計算部１５は、識別対象ＮＷのフローデータについてＩＰアドレスごとの統計的な特徴量である識別用特徴量情報を計算する（ステップＳ１２）。 As shown in FIG. 3, at the time of identification, the collection unit 11 collects flow data of the identification target NW, which is a large-scale NW to be identified (step S11). Subsequently, the feature calculation unit 15 calculates identification feature information, which is a statistical feature for each IP address, for the flow data of the identification target NW (step S12).

アプリケーション識別部１９は、学習済み識別器１８を用いて、識別対象であるフローデータのＩＰアドレスに対応するアプリケーションを識別する（ステップＳ１３）。出力部２０は、アプリケーション識別部１９は、による識別結果を、例えば、外部装置等に出力する（ステップＳ１４）。 The application identifying unit 19 uses the learned classifier 18 to identify the application corresponding to the IP address of the flow data to be identified (step S13). The output unit 20 outputs the identification result obtained by the application identification unit 19 to, for example, an external device (step S14).

［適用例１］
識別装置１０の適用例について説明する。図４は、実施の形態に係る識別装置１０の適用例を説明する図である。 [Application example 1]
An application example of the identification device 10 will be described. FIG. 4 is a diagram illustrating an application example of the identification device 10 according to the embodiment.

図４に示すように、ＩＳＰＮＷにおいて収集されるネットワークフローデータを、識別装置１０で識別し、識別結果としてＩＳＰＮＷのフローデータのＩＰアドレスが各アプリケーションを提供している確率を可視化する。これによって、ネットワーク管理者は、詳細なＮＷ状況を把握できるようになり、重点的に投資するべき経路（例えば、経路Ｒ１，Ｒ２）を把握することができる。このように、識別装置１０を適用することによって、ＩＳＰネットワークのトラフィック可視化によるＮＷ監視の効率化や設備投資計画の効率化を図ることができる。 As shown in FIG. 4, the network flow data collected in the ISP NW is identified by the identification device 10, and as a result of the identification, the probability that the IP address of the flow data of the ISP NW provides each application is visualized. This allows the network administrator to understand the detailed NW situation, and to understand which routes (for example, routes R1 and R2) should be invested heavily. In this way, by applying the identification device 10, it is possible to improve the efficiency of NW monitoring and equipment investment planning by visualizing the traffic of the ISP network.

［適用例２］
図５は、実施の形態に係る識別装置１０の他の適用例を説明する図である。図５に示すように、大規模なトラフィックデータＤｔからごく少量含まれる悪性通信を検知する際に識別装置１０を適用する。 [Application example 2]
FIG. 5 is a diagram illustrating another application example of the identification device 10 according to the embodiment. As shown in FIG. 5, the identification device 10 is applied when detecting a very small amount of malicious communication included in large-scale traffic data Dt.

具体的には、識別装置１０における識別処理を、大規模なトラフィックデータＤｔに実施し、大規模なトラフィックデータＤｔから、事前に正常なトラフィックを除外することによって、調査すべきトラフィックデータＤｍの量を減少させることができる。このように、識別装置１０を適用することによって、悪性通信検知のためのスクリーニングを行うことができ、悪性通信検知にかかる負担を軽減することができる。 Specifically, by performing identification processing in the identification device 10 on large-scale traffic data Dt and excluding normal traffic from the large-scale traffic data Dt in advance, the amount of traffic data Dm to be investigated can be determined. can be reduced. In this way, by applying the identification device 10, screening for malicious communication detection can be performed, and the burden placed on malicious communication detection can be reduced.

［実施の形態の効果］
このように、本実施の形態に係る識別装置１０は、識別器に、小規模ＮＷのデータから生成したラベルつきの学習データを用いて学習させた後、ドメイン適用技術を用いて、ラベルなしの大規模ＮＷである識別対象ＮＷのフローデータと、ラベルなしの小規模ＮＷのデータとを学習させる。 [Effects of embodiment]
In this way, the identification device 10 according to the present embodiment causes the classifier to learn using the labeled learning data generated from the data of the small-scale NW, and then uses the domain application technique to learn the unlabeled large-scale The flow data of the identification target NW, which is a large-scale NW, and the data of a small-scale NW without a label are learned.

この結果、識別装置１０は、Domain Adaptationを用いて、ラベルなしの識別対象ＮＷのフローデータを学習に使用することで、小規模ＮＷのデータから生成したラベルありの学習データで学習させただけの場合と比して、識別対象ＮＷのデータをより正確に識別可能な識別器を構築することができる。 As a result, the identification device 10 uses Domain Adaptation to use the unlabeled flow data of the identification target NW for learning. In this case, it is possible to construct a classifier that can more accurately identify the data of the NW to be identified.

上記のように、識別装置１０によれば、小規模ＮＷのデータに限らず、これまでラベル付加が困難であった大規模ＮＷのフローデータについても、トラフィックを発生させたアプリケーションの識別が可能となり、大規模ＮＷにおいてもアプリケーションレベルのトラフィック識別が可能になる。 As described above, according to the identification device 10, it is possible to identify applications that generate traffic not only for data from small-scale networks but also for flow data from large-scale networks, for which labeling has been difficult until now. , application-level traffic identification becomes possible even in large-scale networks.

［システム構成等］
図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Each component of each device shown in the drawings is functionally conceptual, and does not necessarily need to be physically configured as shown in the drawings. In other words, the specific form of distributing and integrating each device is not limited to what is shown in the diagram, and all or part of the devices can be functionally or physically distributed or integrated in arbitrary units depending on various loads, usage conditions, etc. Can be integrated and configured. Furthermore, all or any part of each processing function performed by each device can be realized by a CPU and a program that is analyzed and executed by the CPU, or can be realized as hardware using wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的におこなうこともでき、あるいは、手動的に行なわれるものとして説明した処理の全部又は一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Furthermore, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed manually. All or part of this can also be done automatically using a known method. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings may be changed arbitrarily, unless otherwise specified.

［プログラム］
図６は、プログラムが実行されることにより、識別装置１０が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 [program]
FIG. 6 is a diagram showing an example of a computer that implements the identification device 10 by executing a program. Computer 1000 includes, for example, a memory 1010 and a CPU 1020. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These parts are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090. Disk drive interface 1040 is connected to disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into disk drive 1100. Serial port interface 1050 is connected to, for example, mouse 1110 and keyboard 1120. Video adapter 1060 is connected to display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ（Operating System）１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、識別装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、識別装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, application programs 1092, program modules 1093, and program data 1094. That is, a program that defines each process of the identification device 10 is implemented as a program module 1093 in which computer-executable code is written. Program module 1093 is stored in hard disk drive 1090, for example. For example, a program module 1093 for executing processing similar to the functional configuration of the identification device 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 Furthermore, the setting data used in the processing of the embodiment described above is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads out the program module 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, program module 1093 and program data 1094 may be stored in another computer connected via a network (LAN, WAN (Wide Area Network), etc.). The program module 1093 and program data 1094 may then be read by the CPU 1020 from another computer via the network interface 1070.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although embodiments to which the invention made by the present inventor is applied have been described above, the present invention is not limited by the description and drawings that form part of the disclosure of the present invention by this embodiment. That is, all other embodiments, examples, operational techniques, etc. made by those skilled in the art based on this embodiment are included in the scope of the present invention.

２Ａ，２Ｂ小規模ネットワーク（ＮＷ）機器
３Ａ，３Ｂ識別対象ＮＷルータ
１０識別装置
１１収集部
１２シグネチャ生成部
１３フローデータ生成部
１４シグネチャデータベース（ＤＢ）
１５特徴量計算部
１６ラベル付加部
１７識別器学習部
１８学習済み識別器
１９アプリケーション識別部
２０出力部 2A, 2B Small network (NW) equipment 3A, 3B NW router to be identified 10 Identification device 11 Collection section 12 Signature generation section 13 Flow data generation section 14 Signature database (DB)
15 Feature calculation section 16 Label addition section 17 Discriminator learning section 18 Learned classifier 19 Application identification section 20 Output section

Claims

An identification method performed by an identification device for identifying an application, the method comprising:
a collection step of collecting packet data from the small-scale network that satisfies predetermined rules and first flow data of the large-scale network to be identified ;
a signature generation step of analyzing packet data from the small network to generate a signature that associates an application with an IP address;
a flow data generation step of generating second flow data from packet data from the small-scale network ;
First feature information, which is a statistical feature for each IP address, is calculated for the first flow data of the large-scale network , and first feature information is a statistical feature for each IP address, for the second flow data. a calculation step of calculating second feature information;
an addition step of adding a label to the second feature amount information using the signature;
a learning step of causing a classifier to learn to identify the application using the first feature information and the second feature information as learning data;
including;
In the learning step, the classifier is trained in advance by using the second feature information to which the label is added as learning data, and the first feature information without a label and the second feature information without a label are trained in advance. An identification method characterized in that the discriminator is trained by a domain application technique using quantity information .

further comprising an identification step of using the identifier to identify an application corresponding to an IP address of the flow data to be identified;
The collecting step collects the flow data to be identified,
The calculation step calculates identification feature amount information, which is a statistical feature amount for each IP address, for the flow data to be identified,
The identification step includes inputting the identification feature information to the classifier, and identifying an application corresponding to the IP address of the flow data to be identified based on the classification result output from the classifier. The identification method according to claim 1, characterized in that:

The calculation step calculates at least one of a histogram of the number of packets, a histogram of the number of bytes, or a histogram of the number of bytes and packets from a set of flow data with a certain IP address as a source and/or destination per 24 hours. The identification method according to claim 1 or 2, characterized in that one of the following is calculated.

An identification device for identifying an application, comprising:
a collection unit that collects packet data from the small-scale network that satisfies predetermined rules and first flow data of the large-scale network to be identified ;
a signature generation unit that analyzes packet data from the small-scale network and generates a signature that associates an application with an IP address;
a flow data generation unit that generates second flow data from packet data from the small-scale network ;
First feature information, which is a statistical feature for each IP address, is calculated for the first flow data of the large-scale network , and first feature information is a statistical feature for each IP address, for the second flow data. a feature calculation unit that calculates second feature information;
a label adding unit that adds a label to the second feature amount information using the signature;
a learning unit that causes a classifier to learn to identify the application using the first feature information and the second feature information as learning data;
has
The learning unit causes the discriminator to learn in advance the second feature information to which the label is added as learning data, and the first feature information without a label and the second feature without a label. An identification device characterized in that the discriminator is trained by a domain application technique using quantity information .

a collection step of collecting packet data from the small network and first flow data of the large network to be identified that satisfy a predetermined rule;
a signature generation step of analyzing packet data from the small network to generate a signature that associates an application with an IP address;
a flow data generation step of generating second flow data from packet data from the small-scale network ;
First feature information, which is a statistical feature for each IP address, is calculated for the first flow data of the large-scale network , and first feature information is a statistical feature for each IP address, for the second flow data. a calculation step of calculating second feature information;
an adding step of adding a label to the second feature amount information using the signature;
a learning step of causing a classifier to learn application identification using the first feature information and the second feature information as learning data;
make the computer run
In the learning step, the classifier is trained in advance by using the second feature information to which the label is added as learning data, and the first feature information without a label and the second feature information without a label are trained in advance. An identification program that performs learning of the discriminator using domain application technology using quantity information .