JP6787861B2

JP6787861B2 - Sorting device

Info

Publication number: JP6787861B2
Application number: JP2017180011A
Authority: JP
Inventors: 泰史西山; 充敏熊谷; 和憲神谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-09-20
Filing date: 2017-09-20
Publication date: 2020-11-18
Anticipated expiration: 2037-09-20
Also published as: JP2019057016A

Description

本発明は、分類装置に関する。 The present invention relates to a classification device.

近年、サイバー攻撃が巧妙化し、ウィルス対策ソフト等による事前対策のみでは、マルウェア感染を完全に防ぐことが困難になっている。そこで、ネットワーク機器の通信ログを解析し、マルウェア感染を早期に検知して通信を遮断する手法の重要性が増している。 In recent years, cyber attacks have become more sophisticated, and it has become difficult to completely prevent malware infection only by taking proactive measures such as antivirus software. Therefore, the importance of a method of analyzing communication logs of network devices, detecting malware infection at an early stage, and blocking communication is increasing.

具体的には、セキュリティベンダの多くが、通信ログを監視／分析してインシデント情報を顧客に提供するＭＳＳ（Managed Security Service）と呼ばれるサービスを提供している。ＭＳＳ事業者は、ＳＯＣ（Security Operation Center）と呼ばれる組織に専門のオペレータやアナリストを常駐させ、顧客のログを監視／分析している。 Specifically, many security vendors provide a service called MSS (Managed Security Service) that monitors / analyzes communication logs and provides incident information to customers. MSS operators have specialized operators and analysts stationed in an organization called SOC (Security Operation Center) to monitor / analyze customer logs.

その際、顧客のネットワーク内の全てのログを手動で分析することは、コストの観点から困難である。そのため、あらかじめ「マルウェア感染の疑いのある通信ログ」と「正常な通信ログ」とを機械的に分類器で分類し、マルウェア感染の疑いのある通信ログのみをアナリストが分析している。新種のマルウェアを検知できるか否かがＭＳＳの競争力の源泉となっているため、分類器による分類では、マルウェア感染の疑いのある通信ログの誤検知を減らし、新種のマルウェアを見逃さないことが重要である。 At that time, it is difficult from the viewpoint of cost to manually analyze all the logs in the customer's network. Therefore, "communication logs suspected of being infected with malware" and "normal communication logs" are mechanically classified by a classifier in advance, and analysts analyze only communication logs suspected of being infected with malware. Since the ability to detect new types of malware is the source of MSS's competitiveness, classifier classification can reduce false positives in communication logs suspected of being infected with malware and not miss new types of malware. is important.

従来、このような分類器は、オペレータやアナリストが様々な情報ソースを用いて手動で作成し、主にマルウェアに関連したシグネチャやホワイトリストのシグネチャを追加し更新しながら運用している。新種のマルウェアが登場するたびにシグネチャの追加が必要となるため、オペレータやアナリストの負担となっている。 Traditionally, such classifiers have been manually created by operators and analysts using a variety of information sources and operated by adding and updating mainly malware-related signatures and whitelist signatures. It is a burden for operators and analysts to add signatures each time a new type of malware appears.

そこで、機械学習を用いて分類器を作成する技術が注目されている。日々大量に作成されている新種のマルウェアの多くは、完全に新しいものではなく、ソースコードが再利用され一部だけが変更されているものや、リパッケージして作成された亜種である場合が多い。したがって、既知のマルウェアと全体の特徴そのものはあまり変わらず、通信のパターンが類似している場合が多い。そのため、通信ログに対して機械学習を適用し分析し、既知のマルウェアと類似する通信の特徴をとらえることにより、新種のマルウェアを検知することが可能となる（非特許文献１〜４参照）。 Therefore, a technique for creating a classifier using machine learning is drawing attention. Many of the new types of malware that are created in large numbers every day are not completely new, if the source code is reused and only partially modified, or if it is a variant created by repackaging. There are many. Therefore, the overall characteristics themselves are not much different from known malware, and communication patterns are often similar. Therefore, it is possible to detect a new type of malware by applying machine learning to the communication log, analyzing it, and capturing communication characteristics similar to known malware (see Non-Patent Documents 1 to 4).

Jastin Ma et al.、“Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs”、KDD'09、2009年Jastin Ma et al., “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs”, KDD'09, 2009 水野翔、他３名、「マルウェア感染ホストが生成する通信の弁別手法」、信学技報、一般社団法人電子情報通信学会、2016年、ICSS2015-66、pp.117-122Sho Mizuno, 3 others, "Method of discriminating communications generated by malware-infected hosts", Institute of Electronics, Information and Communication Engineers, 2016, ICSS2015-66, pp.117-122 Florian Tegeler et al.、“BotFinder: Finding Bots in Network Traffic Without Deep Packet Inspection”、The 8th ACM International Conference on emerging Networking Experiments and Technologies(CoNEXT 2012)、Association for Computing Machinery、2012年、pp.349-360Florian Tegeler et al., “BotFinder: Finding Bots in Network Traffic Without Deep Packet Inspection”, The 8th ACM International Conference on emerging Networking Experiments and Technologies (CoNEXT 2012), Association for Computing Machinery, 2012, pp.349-360 Leyla Bilge et al.、“Disclosure：Detecting Botnet Command and Control Servers Through Large-Scale NetFlow Analysis”、The 28th Annual Computer Security Applications Conference(ACSAC’12)、Association for Computing Machinery、2012年、pp.129-138Leyla Bilge et al., “Disclosure: Detection Botnet Command and Control Servers Through Large-Scale NetFlow Analysis”, The 28th Annual Computer Security Applications Conference (ACSAC’12), Association for Computing Machinery, 2012, pp.129-138

しかしながら、従来の技術では、高精度な分類器を作成して新種のマルウェアを検知することが困難であった。例えば、教師なし学習により作成される分類器（非特許文献３参照）は、一般に精度が低いという問題がある。 However, with conventional technology, it has been difficult to create a highly accurate classifier to detect new types of malware. For example, a classifier created by unsupervised learning (see Non-Patent Document 3) generally has a problem of low accuracy.

また、教師あり学習により分類器を作成する場合（非特許文献１，２，４参照）には、学習用データのラベル付与が困難という問題がある。具体的には、マルウェアの進化に合わせて、学習用データとして正解を示すラベルが付与された通信ログを定期的に更新して分類器を更新する必要がある。そのため、機械学習を用いて分類器を作成するためには、ＳＯＣのアナリスト等の専門家が通信ログを分析し、正常な通信を行う端末が発する通信ログである良性ログか、マルウェアに感染した端末が発する通信ログである悪性ログかを分別して手動でラベルを付与する必要がある。しかし、現状では、人件費や分析にかかる稼働等のコストの負担から、一般には機械学習を用いずにシグネチャを手動で更新することで分類器を作成することが主流となっている。 Further, when a classifier is created by supervised learning (see Non-Patent Documents 1, 2 and 4), there is a problem that it is difficult to label learning data. Specifically, as the malware evolves, it is necessary to periodically update the communication log with a label indicating the correct answer as learning data to update the classifier. Therefore, in order to create a classifier using machine learning, experts such as SOC analysts analyze the communication log, and it is infected with benign log, which is a communication log issued by a terminal that performs normal communication, or malware. It is necessary to distinguish whether it is a malicious log, which is a communication log issued by the terminal, and manually assign a label. However, at present, due to the burden of labor costs and operating costs for analysis, it is generally the mainstream to create a classifier by manually updating the signature without using machine learning.

また、ラベルを付与するために必要となる詳細な解析には、多くの場合、Ｐｒｏｘｙログ等の詳細な情報を含む通信ログが用いられる（非特許文献１，２参照）。しかしながら、詳細な情報を含む通信ログは、情報量が大きいため、また、対応の機器を設置する必要があるため、マルウェアが活動する世界規模の様々な通信環境から取得することは困難である。一方、ｘＦｌｏｗ等の情報量の少ない通信ログ（非特許文献３，４参照）は、世界規模の様々な通信環境から取得が可能であるが、情報量が少ないため、それだけを用いてラベルを付与することは困難である。 Further, in many cases, a communication log including detailed information such as a proxy log is used for the detailed analysis required for giving a label (see Non-Patent Documents 1 and 2). However, it is difficult to obtain a communication log containing detailed information from various communication environments on a global scale in which malware is active because the amount of information is large and it is necessary to install a compatible device. On the other hand, communication logs with a small amount of information such as xFlow (see Non-Patent Documents 3 and 4) can be obtained from various communication environments on a global scale, but since the amount of information is small, labels are given using only those logs. It's difficult to do.

本発明は、上記に鑑みてなされたものであって、ラベル付与の手間を削減しつつ、高精度な分類器を作成して新種のマルウェアを検知することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to create a highly accurate classifier to detect a new type of malware while reducing the labor of labeling.

上述した課題を解決し、目的を達成するために、本発明に係る分類装置は、正常な通信を行う端末が発する通信ログであることを示す良性な通信ログと、マルウェアに感染した端末が発する通信ログであることを示す悪性な通信ログと、良性または悪性のいずれでもない通信ログとの形式を、全通信ログに含まれる全項目を含む形式に変換して、全通信ログを統合する統合部と、統合された前記全通信ログを用いて学習を行って、通信ログを良性または悪性のいずれかに分類する分類器を作成する作成部と、作成された前記分類器を用いて、未知の通信ログを良性または悪性のいずれかに分類する分類部と、を備えることを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the classification device according to the present invention emits a benign communication log indicating that it is a communication log emitted by a terminal performing normal communication, and a terminal infected with malware. Integration that integrates all communication logs by converting the format of a malicious communication log indicating that it is a communication log and a communication log that is neither benign nor malignant into a format that includes all items included in all communication logs. Unknown using the unit, a creation unit that creates a classifier that classifies communication logs into either benign or malignant by learning using the integrated all communication logs, and the created classifier. It is characterized by including a classification unit for classifying the communication log of the above into either benign or malignant.

本発明によれば、ラベル付与の手間を削減しつつ、高精度な分類器を作成して新種のマルウェアを検知することができる。 According to the present invention, it is possible to create a highly accurate classifier to detect a new type of malware while reducing the labor of labeling.

図１は、本実施形態に係る分類装置の概略構成を例示する模式図である。FIG. 1 is a schematic diagram illustrating a schematic configuration of a classification device according to the present embodiment. 図２は、学習用データのデータ構成を例示する図である。FIG. 2 is a diagram illustrating a data structure of learning data. 図３は、学習用データのデータ構成を例示する図である。FIG. 3 is a diagram illustrating a data structure of learning data. 図４は、統合された学習用データを例示する図である。FIG. 4 is a diagram illustrating integrated learning data. 図５は、変換部の処理を説明するための説明図である。FIG. 5 is an explanatory diagram for explaining the processing of the conversion unit. 図６は、変換部の処理を説明するための説明図である。FIG. 6 is an explanatory diagram for explaining the processing of the conversion unit. 図７は、変換部の処理を説明するための説明図である。FIG. 7 is an explanatory diagram for explaining the processing of the conversion unit. 図８は、作成処理手順を示すフローチャートである。FIG. 8 is a flowchart showing the creation processing procedure. 図９は、判定処理手順を示すフローチャートである。FIG. 9 is a flowchart showing the determination processing procedure. 図１０は、分類装置による分類処理の効果を説明するための説明図である。FIG. 10 is an explanatory diagram for explaining the effect of the classification process by the classification device. 図１１は、分類プログラムを実行するコンピュータの一例を示す図である。FIG. 11 is a diagram showing an example of a computer that executes a classification program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals.

［分類装置の構成］
図１は、分類装置の概略構成を例示する模式図である。図１に例示するように、分類装置１０は、パソコン等の汎用コンピュータで実現され、入力部１１、出力部１２、通信制御部１３、記憶部１４、および制御部１５を備える。 [Structure of classification device]
FIG. 1 is a schematic diagram illustrating a schematic configuration of a classification device. As illustrated in FIG. 1, the classification device 10 is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15.

入力部１１は、キーボードやマウス等の入力デバイスを用いて実現され、操作者による入力操作に対応して、制御部１５に対して処理開始などの各種指示情報を入力する。出力部１２は、液晶ディスプレイなどの表示装置、プリンター等の印刷装置等によって実現される。通信制御部１３は、ＮＩＣ（Network Interface Card）等で実現され、ＬＡＮ（Local Area Network）やインターネットなどの電気通信回線を介したネットワーク機器や管理サーバ等の外部の装置と制御部１５との通信を制御する。 The input unit 11 is realized by using an input device such as a keyboard or a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to an input operation by the operator. The output unit 12 is realized by a display device such as a liquid crystal display, a printing device such as a printer, or the like. The communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and communicates between the control unit 15 and an external device such as a network device or a management server via a telecommunication line such as a LAN (Local Area Network) or the Internet. To control.

記憶部１４は、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現され、後述する分類処理により作成される分類器１４ａ等が記憶される。なお、記憶部１４は、通信制御部１３を介して制御部１５と通信する構成でもよい。 The storage unit 14 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk, and is a classifier 14a or the like created by a classification process described later. It will be remembered. The storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13.

制御部１５は、ＣＰＵ（Central Processing Unit）等を用いて実現され、メモリに記憶された処理プログラムを実行する。これにより、制御部１５は、図１に例示するように、学習データ取得部１５ａ、統合部１５ｂ、変換部１５ｃ、作成部１５ｄ、テストデータ取得部１５ｅ、変換部１５ｆおよび分類部１５ｇとして機能する。 The control unit 15 is realized by using a CPU (Central Processing Unit) or the like, and executes a processing program stored in a memory. As a result, as illustrated in FIG. 1, the control unit 15 functions as a learning data acquisition unit 15a, an integration unit 15b, a conversion unit 15c, a creation unit 15d, a test data acquisition unit 15e, a conversion unit 15f, and a classification unit 15g. ..

なお、これらの機能部は、それぞれ、あるいは一部が異なるハードウェアに実装されてもよい。例えば、分類装置１０を、学習データ取得部１５ａ、統合部１５ｂ、変換部１５ｃ、および作成部１５ｄを実装した作成装置と、テストデータ取得部１５ｅ、変換部１５ｆおよび分類部１５ｇを実装した判定装置とに分離してもよい。 It should be noted that these functional parts may be implemented in different hardware, respectively or in part. For example, the classification device 10 is a creation device equipped with a learning data acquisition unit 15a, an integration unit 15b, a conversion unit 15c, and a creation unit 15d, and a determination device equipped with a test data acquisition unit 15e, a conversion unit 15f, and a classification unit 15g. It may be separated into and.

学習データ取得部１５ａは、Ｐｒｏｘｙサーバ等のネットワーク機器や管理サーバ等から、後述する分類器１４ａの学習に用いる学習用データを取得する。ここで、学習用データは、良性挙動データ、悪性挙動データ、および判定なしデータを含む。良性挙動データとは、良性な通信ログ、すなわち正常な通信を行う端末が発する通信ログを意味する。悪性挙動データとは、悪性な通信ログ、すなわちマルウェアに感染した端末が発する通信ログを意味する。判定なしデータとは、良性挙動データまたは悪性挙動データのいずれでもなく、良性または悪性のいずれでもない通信ログを意味する。 The learning data acquisition unit 15a acquires learning data used for learning the classifier 14a, which will be described later, from a network device such as a Proxy server, a management server, or the like. Here, the learning data includes benign behavior data, malignant behavior data, and undetermined data. The benign behavior data means a benign communication log, that is, a communication log generated by a terminal that performs normal communication. The malicious behavior data means a malicious communication log, that is, a communication log generated by a terminal infected with malware. The undetermined data means a communication log that is neither benign behavior data nor malignant behavior data, and is neither benign nor malignant.

なお、良性挙動データ／悪性挙動データ／判定なしデータの取得方法は特に限定されない。例えば、良性挙動データは、マルウェアに感染していないことが明らかな実網内の端末から取得できる。また、悪性挙動データは、既知のマルウェアの検体を仮想環境下で動作させる動的解析により取得できる。あるいは、悪性挙動データは、既知のブラックリストを活用して取得できる。 The method of acquiring benign behavior data / malignant behavior data / data without judgment is not particularly limited. For example, benign behavior data can be obtained from terminals in the real network that are clearly not infected with malware. In addition, malignant behavior data can be obtained by dynamic analysis in which a known malware sample is operated in a virtual environment. Alternatively, malignant behavior data can be obtained by utilizing a known blacklist.

また、判定なしデータには、良性挙動データ／悪性挙動データの両者が混在している可能性がある実網のログを用いればよい。新種のマルウェアを検知可能とするために、判定なしデータには、新種のマルウェアを含みうる通信ログ、あるいは新種のマルウェアと類似点がある通信ログが含まれることが望ましい。ただし、両者が混在しないログを用いてもよい。 Further, as the data without judgment, a log of a real network in which both benign behavior data and malignant behavior data may be mixed may be used. In order to be able to detect a new type of malware, it is desirable that the undetermined data includes a communication log that may contain the new type of malware or a communication log that has similarities to the new type of malware. However, a log in which both are not mixed may be used.

また、通信ログとしては、例えば、Ｐｒｏｘｙログ、ｘＦｌｏｗ、Ｆｉｒｅｗａｌｌログ等の様々な形式の通信ログが用いられる。ここで、Ｐｒｏｘｙログは、Ｐｒｏｘｙサーバから取得される通信ログであり、送信元ＩＰアドレス、ＨＴＴＰメソッド、ＵＲＬ、ＵｓｅｒＡｇｅｎｔ等の情報を含む。 Further, as the communication log, for example, communication logs of various formats such as Proxy log, xFlow, and Firewall log are used. Here, the Proxy log is a communication log acquired from the Proxy server, and includes information such as a source IP address, an HTTP method, a URL, and a User Agent.

また、ｘＦｌｏｗ（ＮｅｔＦｌｏｗ）は、ネットワークのフロー情報である。ｘＦｌｗｏｗは、業界のフロー計測の標準として、多くのベンダーのネットワーク機器でサポートされている。ｘＦｌｏｗは、送信元ＩＰアドレス、宛先ＩＰアドレス、送信元ポート番号、宛先ポート番号、プロトコル等を含む。ｘＦｌｏｗは、Ｐｒｏｘｙログ等に比較して情報量が少ないため、ＩＳＰ相当の大規模なネットワークから取得することも可能だが、詳細な分析ができない。 Further, xFlow (NetFlow) is network flow information. xFlowow is supported by network equipment from many vendors as an industry flow measurement standard. xFlow includes a source IP address, a destination IP address, a source port number, a destination port number, a protocol, and the like. Since xFlow has a smaller amount of information than Proxy logs and the like, it can be obtained from a large-scale network equivalent to ISP, but detailed analysis cannot be performed.

また、Ｆｉｒｅｗａｌｌログは、Ｆｉｒｅｗａｌｌから取得される通信ログであり、送信元ＩＰアドレス、宛先ＩＰアドレス、送信元ポート番号、宛先ポート番号、プロトコル、日時、パケットサイズ等の情報を含む。 The Firewall log is a communication log acquired from the Firewall, and includes information such as a source IP address, a destination IP address, a source port number, a destination port number, a protocol, a date and time, and a packet size.

良性挙動データ、悪性挙動データ、判定なしデータのそれぞれあるいは一部は、互いに異なる形式の通信ログでもよい。ラベル付与の手間を削減しつつ、様々な環境に対応した分類器を得るために、例えば、良性挙動データおよび悪性挙動データには、詳細な情報を含むＰｒｏｘｙログ等の通信ログを用い、判定なしデータには、ラベルは付いていないが広域の情報を含むｘＦｌｏｗ等の通信ログを用いることが望ましい。なお、良性挙動データ、悪性挙動データ、判定なしデータの全てが同一の形式のＰｒｏｘｙログ等の詳細な情報を含む通信ログでもよい。本実施形態では、良性挙動データおよび悪性挙動データとしてＰｒｏｘｙログを用い、判定なしデータとしてｘＦｌｏｗを用いる。 Each or part of the benign behavior data, the malignant behavior data, and the undetermined data may be communication logs in different formats. In order to obtain a classifier suitable for various environments while reducing the labor of labeling, for example, communication logs such as Proxy logs containing detailed information are used for benign behavior data and malignant behavior data, and no judgment is made. It is desirable to use a communication log such as xFlow which is not labeled but contains wide area information for the data. The benign behavior data, the malignant behavior data, and the undetermined data may all be communication logs containing detailed information such as a proxy log in the same format. In this embodiment, a Proxy log is used as benign behavior data and malignant behavior data, and xFlow is used as non-judgment data.

図２および図３は、学習用データのデータ構成を例示する図である。図２には、Ｐｒｏｘｙログを用いた良性挙動データまたは悪性挙動データが例示されている。図２には、Ｌｏｇ１〜Ｌｏｇ３の３つの良性挙動データと、Ｌｏｇ４の１つの悪性挙動データとが例示されている。また、図２に示す学習用データには、取得されたＰｒｏｘｙログに良性または悪性を示すラベルが付与されている。例えば、Ｌｏｇ４の悪性挙動データは、「悪性」ラベルが付与され、送信元ＩＰアドレスが「３０．３０．３０．３０」、ＨＴＴＰメソッドが「ＧＥＴ」、ＵＲＬが「http://malware.co.jp/」、ＨＴＴＰＵｓｅｒＡｇｅｎｔが「<wellknown>」であること等が示されている。 2 and 3 are diagrams illustrating the data structure of the learning data. FIG. 2 illustrates benign behavior data or malignant behavior data using Proxy logs. FIG. 2 exemplifies three benign behavior data of Log1 to Log3 and one malignant behavior data of Log4. Further, in the learning data shown in FIG. 2, a label indicating benign or malignant is given to the acquired Proxy log. For example, the malignant behavior data of Log4 is given a "malignant" label, the source IP address is "30.30.30.30", the HTTP method is "GET", and the URL is "http://malware.co. It is shown that jp / ”and the HTTP User Agent are“ <well known> ”.

また、図３には、ｘＦｌｏｗを用いた判定なしデータが例示されている。図３には、例えば、ＬｏｇＡの判定なしデータは、送信元ＩＰアドレスが「２０．２０．２０．２０」、宛先ＩＰアドレスが「４．４．４．４」、宛先ポート番号が「８０」、プロトコルが「ＴＣＰ」であること等が示されている。 Further, FIG. 3 illustrates undetermined data using xFlow. In FIG. 3, for example, in the data without determination of LogA, the source IP address is “20.20.20.20”, the destination IP address is “4.4.4.4”, and the destination port number is “80”. , The protocol is "TCP", etc.

統合部１５ｂは、正常な通信を行う端末が発する通信ログであることを示す良性な通信ログと、マルウェアに感染した端末が発する通信ログであることを示す悪性な通信ログと、良性または悪性のいずれでもない通信ログとの形式を、全通信ログに含まれる全項目を含む形式に変換して、全通信ログを統合する。具体的には、統合部１５ｂは、良性挙動データと、悪性挙動データと、判定なしデータとの各データに含まれる項目を結合することにより、全学習用データの形式を統一化して全学習用データを統合する。 The integration unit 15b includes a benign communication log indicating that the communication log is issued by a terminal that performs normal communication, a malicious communication log indicating that the communication log is issued by a terminal infected with malware, and a benign or malignant communication log. Convert the format with the communication log that is neither of them to the format that includes all the items included in all communication logs, and integrate all communication logs. Specifically, the integration unit 15b unifies the format of all learning data by combining the items included in the benign behavior data, the malignant behavior data, and the non-judgment data, and is used for all learning. Integrate data.

図４は、統合された学習用データを例示する図である。図４に示すように、統合部１５ｂは、良性挙動データ、悪性挙動データ、および判定なしデータに含まれる全項目を結合する。図４に示す例では、各データに該当する値が含まれない項目は「−」で示されている。例えば、ＬｏｇＡは、ｘＦｌｏｗを用いた判定なしデータであり、ＵＲＬ、ＨＴＴＰメソッド、ＨＴＴＰＵｓｅｒＡｇｅｎｔ、ＨＴＴＰＳｔａｔｕｓＣｏｄｅおよびラベルに該当する値が存在しないため、各項目の値が「−」で示されている。 FIG. 4 is a diagram illustrating integrated learning data. As shown in FIG. 4, the integration unit 15b combines all the items included in the benign behavior data, the malignant behavior data, and the undetermined data. In the example shown in FIG. 4, items that do not include a value corresponding to each data are indicated by "-". For example, LogA is undetermined data using xFlow, and since there are no values corresponding to URL, HTTP Method, HTTP User Agent, HTTP Status Code, and label, the value of each item is indicated by "-". There is.

図１の説明に戻る。変換部１５ｃは、統合された学習用データを、後述する作成部１５ｄの処理に用いるための準備として、統合された学習用データの特徴量を抽出し、特徴ベクトルへ変換する。まず、変換部１５ｃは、統合された学習用データから、学習の着眼点の組み合わせである特徴量を抽出する。なお、特徴量の抽出の手法は特に限定されない。人手によってもよいし、ディープラーニング等のように自動的に特徴を抽出して機械学習を行う手法を適用してもよい。 Returning to the description of FIG. The conversion unit 15c extracts the feature amount of the integrated learning data and converts it into a feature vector in preparation for using the integrated learning data for the processing of the creation unit 15d described later. First, the conversion unit 15c extracts a feature amount, which is a combination of learning points of view, from the integrated learning data. The method for extracting the feature amount is not particularly limited. It may be done manually, or a method of automatically extracting features and performing machine learning such as deep learning may be applied.

ここで、機械学習とは、抽出された特徴量のパターンを学習し、目的の分類を行うモデルを作成することである。本実施形態の分類装置１０においては、良性／悪性を分類するため、例えば、ＵＲＬのホスト名、宛先ポート番号、パスの長さ、ドメイン名がＩＰアドレスか否か、ＣｏｕｎｔｒｙＣｏｄｅ、通信時間間隔等が特徴量として抽出される。 Here, machine learning is to learn the pattern of the extracted features and create a model for classifying the target. In the classification device 10 of the present embodiment, in order to classify benign / malignant, for example, the host name of the URL, the destination port number, the length of the path, whether the domain name is an IP address, the Country Code, the communication time interval, etc. Is extracted as a feature quantity.

次に、変換部１５ｃは、抽出した特徴量を特徴ベクトルに変換する。具体的には、変換部１５ｃは、Ｂａｇ−ｏｆ−ＷｏｒｄｓやＮ−ｇｒａｍ等の手法を用いて、特徴量を特徴ベクトルに変換する。本実施形態では、変換部１５ｃが、Ｂａｇ−ｏｆ−Ｗｏｒｄｓの手法を用いて、各特徴量において存在する全てのパターンを１つの要素とみなし、各要素が通信ログに出現したかどうかを０／１で表すことにより、特徴量を特徴ベクトルに変換する。 Next, the conversion unit 15c converts the extracted feature amount into a feature vector. Specifically, the conversion unit 15c converts the feature quantity into a feature vector by using a method such as Bag-of-Words or N-gram. In the present embodiment, the conversion unit 15c uses the Bag-of-Words method to consider all the patterns existing in each feature amount as one element, and 0 / whether or not each element appears in the communication log. By representing by 1, the feature quantity is converted into a feature vector.

ここで、図５〜図７は、変換部１５ｃの処理を説明するための説明図である。図５は、図４に例示した学習用データから抽出された特徴量を例示している。図５に示す例では、特徴量として、送信元ＩＰアドレス、宛先ＩＰアドレス、宛先ポート番号、ドメイン名、およびドメイン名内の数字の数等が抽出されている。 Here, FIGS. 5 to 7 are explanatory views for explaining the processing of the conversion unit 15c. FIG. 5 illustrates the features extracted from the learning data illustrated in FIG. In the example shown in FIG. 5, the source IP address, the destination IP address, the destination port number, the domain name, the number of numbers in the domain name, and the like are extracted as feature quantities.

また、図６は、特徴量から変換された特徴ベクトルの各要素を例示している。図６に示す例では、図５に例示した特徴量のうち、例えば、宛先ＩＰアドレスについて、存在する６つのパターン「１．１．１．１」〜「６．６．６．６」のそれぞれを特徴ベクトルの１つの要素とみなしている。そして、各通信ログに各要素が出現した場合を１、出現していない場合を０で表している。同様に、宛先ポート番号の存在する３つのパターン「８０」「２３２３」「８０８０」のそれぞれを特徴ベクトルの１つの要素とみなし、各通信ログに出現した要素を１、出現していない場合を０で表している。このように、変換部１５ｃは、該当するデータがない特徴量を０とする。これにより、該当するデータがない特徴量は、特徴量やその組み合わせに対して重み付けして行う分類に影響を及ぼさない。 Further, FIG. 6 illustrates each element of the feature vector converted from the feature quantity. In the example shown in FIG. 6, among the feature quantities illustrated in FIG. 5, for example, for the destination IP address, each of the six existing patterns "1.1.1.1" to "6.6.6.6". Is regarded as one element of the feature vector. Then, the case where each element appears in each communication log is represented by 1, and the case where each element does not appear is represented by 0. Similarly, each of the three patterns "80", "2323", and "8080" in which the destination port number exists is regarded as one element of the feature vector, the element appearing in each communication log is 1, and the case where it does not appear is 0. It is represented by. In this way, the conversion unit 15c sets the feature amount for which there is no corresponding data to 0. As a result, the feature amount for which there is no corresponding data does not affect the classification performed by weighting the feature amount and its combination.

また、変換部１５ｃは、学習用データの良性または悪性を示すラベルを数値ラベルに変換する。例えば、良性を示すラベルを０、悪性を示すラベルを１として、ラベルを数値化して表す。図７は、特徴量から変換された特徴ベクトルおよびラベルから変換された数値ラベルを例示している。図７において、例えば、Ｌｏｇ１の特徴ベクトルについて、宛先ＩＰアドレスが「１．１．１．１」に対応する要素に１が割り当てられている。また、このＬｏｇ１において、ラベルが良性を示す０とされている。なお、図７において、ラベルが付与されていない判定なしデータについては、ラベルは「−」で表されている。 In addition, the conversion unit 15c converts a label indicating benign or malignant of the learning data into a numerical label. For example, the label indicating benign is 0, the label indicating malignant is 1, and the label is quantified. FIG. 7 illustrates a feature vector converted from a feature quantity and a numerical label converted from a label. In FIG. 7, for example, with respect to the feature vector of Log1, 1 is assigned to the element whose destination IP address corresponds to “1.1.1.1”. Further, in this Log1, the label is set to 0 indicating benignity. In FIG. 7, the label of the undetermined data to which the label is not attached is represented by “−”.

図１の説明に戻る。作成部１５ｄは、統合された全通信ログを用いて学習を行って、通信ログを良性または悪性のいずれかに分類する分類器１４ａを作成する。具体的には、作成部１５ｄは、良性であることを示すラベルまたは悪性であることを示すラベルのいずれかが付与された通信ログと、ラベルが付与されていない通信ログとを用いて、分類器１４ａによるラベルの付与を学習する。 Returning to the description of FIG. The creation unit 15d creates a classifier 14a that classifies the communication log into either benign or malignant by performing learning using the integrated all communication logs. Specifically, the creation unit 15d classifies the communication log with either a label indicating benign or a label indicating malignant, and a communication log without a label. Learn to label with the vessel 14a.

すなわち、作成部１５ｄは、変換部１５ｃが変換した特徴ベクトルおよび数値ラベルを用いて半教師あり学習を行って、通信ログの良性または悪性の程度を示すモデルを分類器１４ａとして作成する。また、作成部１５ｄは、作成した分類器１４ａを記憶部１４に格納する。 That is, the creation unit 15d performs semi-supervised learning using the feature vector and the numerical label converted by the conversion unit 15c, and creates a model showing the degree of benign or malignantness of the communication log as the classifier 14a. Further, the creating unit 15d stores the created classifier 14a in the storage unit 14.

ここで、半教師あり学習のアルゴリズムは特に限定されない。例えば、ＴＳＶＭ（Transductive Support Vector Machine）、半教師ＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎ、ＬａｂｅｌＰｒｏｐａｇａｔｉｏｎ、半教師ＧＭＭ（Gaussian Mixture Model）、Ｓｅｌｆ−ｔｒａｉｎｉｎｇ等が適用される。 Here, the algorithm for semi-supervised learning is not particularly limited. For example, TSVM (Transductive Support Vector Machine), semi-teacher Logistic Regression, Label Promotion, semi-teacher GMM (Gaussian Mixture Model), Self-training and the like are applied.

テストデータ取得部１５ｅは、学習データ取得部１５ａと同様に、Ｐｒｏｘｙサーバ等のネットワーク機器や管理サーバ等から、後述する分類部１５ｇの処理対象となるテスト用データを取得する。テスト用データには、良性な通信ログか悪性な通信ログかを判定したい通信ログを用いる。ここで用いる通信ログは、良性挙動データ、悪性挙動データ、および判定なしデータと同一の形式の通信ログを用いてもよいし、異なる形式の通信ログを用いてもよい。なお、判定なしデータと全く同じ通信ログをテスト用データとして用いることも可能である。また、テストデータ取得部１５ｅは、学習データ取得部１５ａと同一の機能部としてもよい。 Similar to the learning data acquisition unit 15a, the test data acquisition unit 15e acquires test data to be processed by the classification unit 15g, which will be described later, from a network device such as a Proxy server, a management server, or the like. For the test data, use the communication log for which you want to determine whether it is a benign communication log or a malicious communication log. As the communication log used here, a communication log having the same format as the benign behavior data, the malignant behavior data, and the undetermined data may be used, or a communication log having a different format may be used. It is also possible to use the same communication log as the data without judgment as the test data. Further, the test data acquisition unit 15e may be the same functional unit as the learning data acquisition unit 15a.

変換部１５ｆは、前述の変換部１５ｃと同様に、後述する分類部１５ｇの処理に用いるための準備として、テスト用データの特徴量を抽出し、特徴ベクトルへ変換する。変換部１５ｆは、変換部１５ｃと同一の機能部としてもよい。 Similar to the conversion unit 15c described above, the conversion unit 15f extracts the feature amount of the test data and converts it into a feature vector in preparation for use in the processing of the classification unit 15g described later. The conversion unit 15f may be the same functional unit as the conversion unit 15c.

分類部１５ｇは、作成された分類器１４ａを用いて、未知の通信ログを良性または悪性のいずれかに分類する。具体的には、分類部１５ｇは、変換部１５ｆが変換した特徴ベクトルを分類器１４ａに代入し、分類器１４ａが出力する通信ログの良性または悪性の程度を示すスコアが所定の閾値より高い場合に、良性または悪性と判定する。 The classification unit 15g classifies the unknown communication log into either benign or malignant using the created classifier 14a. Specifically, when the classification unit 15g substitutes the feature vector converted by the conversion unit 15f into the classifier 14a and the score indicating the degree of benign or malignancy of the communication log output by the classifier 14a is higher than a predetermined threshold value. In addition, it is judged to be benign or malignant.

［分類処理］
次に、図８および図９を参照して、本実施形態に係る分類装置１０による分類処理について説明する。分類処理は、作成処理と判定処理とを含む。図８は、作成処理手順を示すフローチャートである。図８のフローチャートは、例えば、作成処理の開始を指示する操作入力があったタイミングで開始される。 [Classification process]
Next, the classification process by the classification device 10 according to the present embodiment will be described with reference to FIGS. 8 and 9. The classification process includes a creation process and a determination process. FIG. 8 is a flowchart showing the creation processing procedure. The flowchart of FIG. 8 is started, for example, at the timing when there is an operation input instructing the start of the creation process.

まず、学習データ取得部１５ａが、入力部１１あるいは通信制御部１３を介して、学習用データの入力を受け付ける（ステップＳ１）。次に、統合部１５ｂが、入力された学習用データである良性挙動データと、悪性挙動データと、判定なしデータとの項目を結合することにより、全データの形式を統一化してデータを統合する（ステップＳ２）。 First, the learning data acquisition unit 15a receives the input of the learning data via the input unit 11 or the communication control unit 13 (step S1). Next, the integration unit 15b unifies the format of all the data and integrates the data by combining the items of the benign behavior data, the malignant behavior data, and the non-judgment data, which are the input training data. (Step S2).

次に、変換部１５ｃが、形式が統合された学習用データの特徴量を抽出する（ステップＳ３）。また、変換部１５ｃが、抽出した特徴量を特徴ベクトルへ変換する（ステップＳ４）。 Next, the conversion unit 15c extracts the feature amount of the learning data whose format is integrated (step S3). Further, the conversion unit 15c converts the extracted feature amount into a feature vector (step S4).

また、作成部１５ｄが、変換部１５ｃが変換した特徴ベクトルを用いて学習を行って、通信ログの良性または悪性の程度を示すモデルを分類器１４ａとして作成する（ステップＳ５）。これにより、一連の作成処理が終了する。 Further, the creation unit 15d performs learning using the feature vector converted by the conversion unit 15c, and creates a model showing the degree of benign or malignantness of the communication log as the classifier 14a (step S5). As a result, a series of creation processes is completed.

図９は、判定処理手順を示すフローチャートである。図９のフローチャートは、例えば、判定処理の開始を指示する操作入力があったタイミングで開始される。 FIG. 9 is a flowchart showing the determination processing procedure. The flowchart of FIG. 9 is started, for example, at the timing when there is an operation input instructing the start of the determination process.

まず、テストデータ取得部１５ｅが、入力部１１あるいは通信制御部１３を介して、処理対象のテスト用データの入力を受け付ける（ステップＳ１１）。次に、変換部１５ｆが、テスト用データの特徴量を抽出する（ステップＳ１２）。また、変換部１５ｆが、抽出した特徴量を特徴ベクトルへ変換する（ステップＳ１３）。 First, the test data acquisition unit 15e receives the input of the test data to be processed via the input unit 11 or the communication control unit 13 (step S11). Next, the conversion unit 15f extracts the feature amount of the test data (step S12). Further, the conversion unit 15f converts the extracted feature amount into a feature vector (step S13).

次に、分類部１５ｇが、変換部１５ｆが変換した特徴ベクトルを分類器１４ａに代入する（ステップＳ１４）。分類器１４ａは、通信ログの良性または悪性の程度を示すスコアを算出して出力する（ステップＳ１５）。そして、分類部１５ｇが、分類器１４ａが出力したスコアが所定の閾値より高い場合に、良性または悪性と判定する(ステップＳ１６)。これにより、一連の判定処理が終了する。 Next, the classification unit 15g substitutes the feature vector converted by the conversion unit 15f into the classifier 14a (step S14). The classifier 14a calculates and outputs a score indicating the degree of benign or malignantness of the communication log (step S15). Then, when the score output by the classifier 14a is higher than the predetermined threshold value, the classification unit 15g determines that it is benign or malignant (step S16). As a result, a series of determination processes is completed.

以上、説明したように、本実施形態の分類装置１０において、統合部１５ｂが、正常な通信を行う端末が発する通信ログであることを示す良性な通信ログと、マルウェアに感染した端末が発する通信ログであることを示す悪性な通信ログと、良性または悪性のいずれでもない通信ログとの形式を、全通信ログに含まれる全項目を含む形式に変換して、全通信ログを統合する。また、作成部１５ｄが、統合された良性な通信ログと悪性な通信ログといずれでもない通信ログとを用いて学習を行って、通信ログを良性または悪性のいずれかに分類する分類器１４ａを作成する。また、分類部１５ｇが、作成された分類器１４ａを用いて、未知の通信ログを良性または悪性のいずれかに分類する。 As described above, in the classification device 10 of the present embodiment, the integrated unit 15b has a benign communication log indicating that it is a communication log issued by a terminal performing normal communication, and a communication emitted by a terminal infected with malware. The format of the malicious communication log indicating that it is a log and the communication log that is neither benign nor malicious is converted into a format that includes all items included in all communication logs, and all communication logs are integrated. In addition, the creation unit 15d learns using the integrated benign communication log, the malignant communication log, and the non-nign communication log, and classifies the communication log into either benign or malignant classifier 14a. create. In addition, the classification unit 15g classifies the unknown communication log into either benign or malignant using the created classifier 14a.

これにより、分類装置１０は、既知の良性挙動データ、悪性挙動データに加え、判定なしデータを学習用データとして用いて分類器１４ａを作成することができる。従来、精度の高い分類器を作成するために、ラベルが付与された大量の通信ログを用意して分類器を更新する必要があった。これに対し、本実施形態の分類装置１０によれば、分類器の更新に用いる通信ログとして、少量のラベルが付与された通信ログ（良性挙動データおよび悪性挙動データ）と、大量のラベルが付与されていない通信ログ（判定なしデータ）とを用いて同時に学習させることができるので、ラベル付与の手間を削減しつつ容易に学習データを用意して高精度な分類器１４ａを作成することが可能となる。 As a result, the classification device 10 can create the classifier 14a by using the data without determination as the learning data in addition to the known benign behavior data and malignant behavior data. Conventionally, in order to create a highly accurate classifier, it has been necessary to prepare a large amount of labeled communication logs and update the classifier. On the other hand, according to the classification device 10 of the present embodiment, as communication logs used for updating the classifier, a communication log with a small amount of labels (beneficial behavior data and malignant behavior data) and a large amount of labels are added. Since it is possible to train at the same time using the communication log (data without judgment) that has not been performed, it is possible to easily prepare the training data and create a highly accurate classifier 14a while reducing the trouble of labeling. It becomes.

ここで、図１０は、分類装置１０による分類処理の効果を説明するための説明図である。図１０（ａ）に示すように、学習用データとして、良性挙動データおよび悪性挙動データを用い、判定なしデータを用いない場合には、良性挙動データまたは悪性挙動データが疎な領域において、良性と悪性との境界となる閾値を推定することが難しい。したがって、分類対象のテスト用データの良性／悪性の判定が難しい。 Here, FIG. 10 is an explanatory diagram for explaining the effect of the classification process by the classification device 10. As shown in FIG. 10A, when benign behavior data and malignant behavior data are used as training data and no judgment data is used, benign behavior data or malignant behavior data is considered benign in a sparse region. It is difficult to estimate the threshold that is the boundary with malignancy. Therefore, it is difficult to judge whether the test data to be classified is benign / malignant.

これに対し、本実施形態の分類装置１０では、良性挙動データおよび悪性挙動データに加え、判定なしデータを学習用データとして用いる。これにより、図１０（ｂ）に示すように、良性挙動データまたは悪性挙動データが疎な領域において、データの分布等の特徴に関する情報を判定なしデータから得ることができる。したがって、分類の精度を向上させることができる。 On the other hand, in the classification device 10 of the present embodiment, in addition to the benign behavior data and the malignant behavior data, the data without determination is used as the learning data. As a result, as shown in FIG. 10B, in a region where the benign behavior data or the malignant behavior data is sparse, information on features such as data distribution can be obtained from the undetermined data. Therefore, the accuracy of classification can be improved.

なお、分類対象のテスト用データを、学習用データとして用いてもよい。この場合に、図１０（ｃ）に示すように、テスト用データを図１０（ｂ）に示した判定なしデータとして扱うことにより、良性挙動データまたは悪性挙動データが疎な領域において、データの分布等の特徴に関する情報をテスト用データから得ることができる。 The test data to be classified may be used as learning data. In this case, as shown in FIG. 10 (c), by treating the test data as the undetermined data shown in FIG. 10 (b), the distribution of the data in the region where the benign behavior data or the malignant behavior data is sparse. Information on such features can be obtained from the test data.

また、本実施形態の分類装置１０は、通信ログの形式を問わず、異なる形式の通信ログを学習データとして用いて学習することができる。ネットワークの環境によりネットワーク機器の設置状況が異なり、実網から取得できる通信ログの形式は様々である。各形式の通信ログに含まれる情報は異なるため、サイバー攻撃の痕跡を発見するためには、複数の形式の通信ログを多面から相関的に分析する必要がある。従来の機械学習では、学習データの形式が同一でなければ分類器を作成できなかった。これに対し、本実施形態の分類装置１０は、異なる形式の通信ログを用いて分類器を作成することができる。 Further, the classification device 10 of the present embodiment can learn by using the communication log of a different format as the learning data regardless of the format of the communication log. The installation status of network devices differs depending on the network environment, and the format of communication logs that can be acquired from the actual network varies. Since the information contained in the communication logs of each format is different, it is necessary to correlate the communication logs of multiple formats from various aspects in order to find traces of cyber attacks. In conventional machine learning, a classifier could not be created unless the learning data formats were the same. On the other hand, the classification device 10 of the present embodiment can create a classifier using communication logs of different formats.

このように、分類装置１０は、ラベルが付与された通信ログだけでは得られなかった新種のマルウェアに関連した情報を、ラベルが付与されていない通信ログから得ることができる。そのため、新種のマルウェアに対応した分類処理を行える。 In this way, the classification device 10 can obtain information related to a new type of malware that could not be obtained only from the labeled communication log from the unlabeled communication log. Therefore, it is possible to perform classification processing corresponding to a new type of malware.

また、新たに追加する学習用データとして、ラベルが付与されていない通信ログを用いることができるため、ＳＯＣのアナリスト等の専門家が分析してラベルを付与する手間を省略することができる。そのため、分類器１４ａの更新にかかる負荷を軽減することができる。 Further, since the communication log without a label can be used as the newly added learning data, it is possible to save the trouble of analyzing and assigning a label by an expert such as an SOC analyst. Therefore, the load on updating the classifier 14a can be reduced.

したがって、本実施形態の分類装置１０によれば、学習用データのラベル付与の負担を軽減し、形式の異なる通信ログを学習用データとして用いて、高精度な分類器を作成して新種のマルウェアを検知することができる。 Therefore, according to the classification device 10 of the present embodiment, the burden of labeling the learning data is reduced, and a communication log having a different format is used as the learning data to create a highly accurate classifier to create a new type of malware. Can be detected.

［プログラム］
上記実施形態に係る分類装置１０が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。一実施形態として、分類装置１０は、パッケージソフトウェアやオンラインソフトウェアとして上記の分類処理を実行する分類プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の分類プログラムを情報処理装置に実行させることにより、情報処理装置を分類装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）などの移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistants）などのスレート端末などがその範疇に含まれる。 [program]
It is also possible to create a program in which the processing executed by the classification device 10 according to the above embodiment is described in a language that can be executed by a computer. In one embodiment, the classification device 10 can be implemented by installing a classification program that executes the above classification process as package software or online software on a desired computer. For example, by causing the information processing device to execute the above classification program, the information processing device can function as the classification device 10. The information processing device referred to here includes a desktop type or notebook type personal computer. In addition, the information processing device includes smartphones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone System), and slate terminals such as PDAs (Personal Digital Assistants).

また、分類装置１０は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の分類処理に関するサービスを提供するサーバ装置として実装することもできる。例えば、分類装置１０は、学習用データおよび未知のデータを入力とし、未知のデータの良性／悪性の判定結果を出力する分類処理サービスを提供するサーバ装置として実装される。この場合、分類装置１０は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の分類処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。以下に、分類装置１０と同様の機能を実現する分類プログラムを実行するコンピュータの一例を説明する。 Further, the classification device 10 can be implemented as a server device in which the terminal device used by the user is a client and the service related to the above classification process is provided to the client. For example, the classification device 10 is implemented as a server device that provides a classification processing service that inputs learning data and unknown data and outputs a benign / malignant determination result of the unknown data. In this case, the classification device 10 may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above classification processing by outsourcing. An example of a computer that executes a classification program that realizes the same functions as the classification device 10 will be described below.

図１１は、分類プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 FIG. 11 is a diagram showing an example of a computer that executes a classification program. The computer 1000 has, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。ディスクドライブ１０４１には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１０５１およびキーボード１０５２が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１０６１が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. For example, a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050. For example, a display 1061 is connected to the video adapter 1060.

ここで、ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。上記実施形態で説明した分類器１４ａ等の各種情報テーブルは、例えばハードディスクドライブ１０３１やメモリ１０１０に記憶される。 Here, the hard disk drive 1031 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. Various information tables such as the classifier 14a described in the above embodiment are stored in, for example, the hard disk drive 1031 or the memory 1010.

また、分類プログラムは、例えば、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、ハードディスクドライブ１０３１に記憶される。具体的には、上記実施形態で説明した分類装置１０が実行する各処理が記述されたプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 Further, the classification program is stored in the hard disk drive 1031 as, for example, a program module 1093 in which a command executed by the computer 1000 is described. Specifically, the program module 1093 in which each process executed by the classification device 10 described in the above embodiment is described is stored in the hard disk drive 1031.

また、分類プログラムによる情報処理に用いられるデータは、プログラムデータ１０９４として、例えば、ハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、ハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Further, the data used for information processing by the classification program is stored as program data 1094 in, for example, the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the hard disk drive 1031 into the RAM 1012 as needed, and executes each of the above-described procedures.

なお、分類プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、分類プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮやＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and program data 1094 related to the classification program are not limited to the case where they are stored in the hard disk drive 1031. For example, they are stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. May be done. Alternatively, the program module 1093 and the program data 1094 related to the classification program are stored in another computer connected via a network such as a LAN or WAN (Wide Area Network), and read by the CPU 1020 via the network interface 1070. You may.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述および図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例および運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, all other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are included in the scope of the present invention.

１０分類装置
１１入力部
１２出力部
１３通信制御部
１４記憶部
１４ａ分類器
１５制御部
１５ａ学習データ取得部
１５ｂ統合部
１５ｃ変換部
１５ｄ作成部
１５ｅテストデータ取得部
１５ｆ変換部
１５ｇ分類部 10 Classification device 11 Input unit 12 Output unit 13 Communication control unit 14 Storage unit 14a Classifier 15 Control unit 15a Learning data acquisition unit 15b Integration unit 15c Conversion unit 15d Creation unit 15e Test data acquisition unit 15f Conversion unit 15g Classification unit

Claims

A benign communication log indicating that the communication log is issued by a terminal that performs normal communication, a malicious communication log indicating that the communication log is issued by a terminal infected with malware, and a communication log that is neither benign nor malignant. The integrated part that integrates all communication logs by converting the format of and to a format that includes all items included in all communication logs,
With a conversion unit that converts the integrated feature quantity of all communication logs into a feature vector and converts the label indicating benign or the label indicating malignancy attached to the communication log into a numerical label. ,
Learning is performed using the feature vector, the communication log to which the numerical label is attached, and the communication log to which the numerical label is not attached, and a label for classifying the communication log as benign or malignant is given. And the creator that creates the classifier
A classifier that classifies unknown communication logs or communication logs that are neither benign nor malignant used by the creator into either benign or malignant using the created classifier.
A classification device characterized by comprising.

The classification device according to claim 1, wherein the classification unit determines that the classifier is benign or malignant when the score indicating the degree of benign or malignant output is higher than a predetermined threshold value.

The classification device according to claim 1, wherein each or a part of the benign communication log, the malicious communication log, and the non-nign communication log integrated by the integration unit are communication logs of different formats.

The classification device according to claim 1, wherein the classification unit classifies each terminal that emits a communication log into either a normal terminal that performs normal communication or an infected terminal that is infected with malware.