JP2016206950A

JP2016206950A - Perusal training data output device for malware determination, malware determination system, malware determination method, and perusal training data output program for malware determination

Info

Publication number: JP2016206950A
Application number: JP2015087924A
Authority: JP
Inventors: 靖岡野; Yasushi Okano; 充敏熊谷; Mitsutoshi Kumagai; 嘉人大嶋; Yoshito Oshima
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-04-22
Filing date: 2015-04-22
Publication date: 2016-12-08

Abstract

PROBLEM TO BE SOLVED: To make malware determination efficiently and free of false detection using machine learning.SOLUTION: The present invention extracts a feature vector of training data 100 that includes existing malware data 101 and existing good-ware data 102 relative to misdetected good-ware data 110, removes some of the existing malware data 101 the feature vector of which is similar to the misdetected good-ware data 110, and inserts the misdetected good-ware data 110 into the training data and makes it perusal training data 120. At this time, a sorter that has learned the perusal training data 120 inserts the misdetected good-ware data 110 at a position at which a misdetection rate to the misdetected good-ware data 110 is reduced.SELECTED DRAWING: Figure 1

Description

本発明は、マルウェア判定のための精査教師データ出力装置、マルウェア判定システム、マルウェア判定方法およびマルウェア判定のための精査教師データ出力プログラムに関する。 The present invention relates to a scrutinizing teacher data output device for malware determination, a malware determination system, a malware determination method, and a scrutinizing teacher data output program for malware determination.

Windows（登録商標）やUNIX（登録商標）の実行ファイルがマルウェアか否かを判定するアンチウィルスシステムでは、実行ファイルを実行して判定する動的判定、実行せずに判定する静的判定の２手法が用いられている。さらに、判定に特に速度が求められる場合は、静的判定が用いられる。 In an anti-virus system that determines whether or not an executable file of Windows (registered trademark) or UNIX (registered trademark) is malware, dynamic determination that determines whether or not the executable file is executed and static determination that determines without executing the executable file The method is used. Furthermore, static determination is used when speed is particularly required for determination.

静的判定の手法として代表的なものに、ハッシュ値一致判定およびパターンマッチ判定（シグニチャスキャン）が挙げられる。まず、ハッシュ値一致判定とは、予め既知のマルウェアのＭＤ５、ＳＨＡ１、ＳＨＡ２５６等のハッシュ値をデータベースとして持ち、検査ファイルのハッシュ値がそのデータベースに合致すればマルウェアとするものである。次に、パターンマッチ判定とは、予め既知のマルウェアに含まれる特定の文字列やバイトコードをデータベースとして持ち、検査対象ファイル中にそれらのデータベース登録された文字列・バイトコードのいずれかを含んでいればマルウェアとするものである。 Representative examples of static determination methods include hash value match determination and pattern match determination (signature scan). First, the hash value match determination has a hash value such as MD5, SHA1, SHA256, etc. of known malware in advance as a database, and if the hash value of the inspection file matches that database, the hash value is determined as malware. Next, pattern match determination has a specific character string or byte code included in known malware in advance as a database, and the file to be inspected contains any character string or byte code registered in the database. If it is, it will be malware.

しかし、ハッシュ値一致判定およびパターンマッチ判定では、既存マルウェアを改造した亜種マルウェアや新種のマルウェアの検知は難しかった。そのため、亜種・新種のマルウェアを判定する手法として、ヒューリスティック判定が提案されている。これは、これまでの経験に基づいてマルウェアらしさを定義し、その定義に従ってマルウェアか否かを判定するものである。 However, in the hash value match determination and the pattern match determination, it is difficult to detect a variant malware or a new malware that is a modification of the existing malware. For this reason, heuristic determination has been proposed as a method for determining sub-species / new-type malware. This defines malware likeness based on past experience, and determines whether or not it is malware according to the definition.

ヒューリスティック判定の手法として、機械学習技術を用いる手法がいくつか提案されている。例えば、実行ファイル中に含まれる可読文字列をあらかじめ学習し、マルウェアで良く用いられる語が検査ファイル中にどの程度含まれるかを基準に、検査ファイルのマルウェアらしさを判定する方法が提案されている（例えば、特許文献１参照）。提案されている方法においては、例えば、学習対象のデータである教師データをいくつかのパラメータの組に変換したうえで機械学習アルゴリズムによって学習を行う。ここで、個々のパラメータを特徴、パラメータの組を特徴ベクトルと呼ぶ。例えば、単語名とその出現数の組が特徴ベクトルである。そして、機械学習技術を用いたヒューリスティック判定を用いた手法において、分類精度（Ｆ値等）や学習処理速度を改善する種々の方法が提案されている。 Several methods using machine learning techniques have been proposed as heuristic determination methods. For example, a method has been proposed in which a human-readable character string included in an executable file is learned in advance, and the malware-likeness of the inspection file is determined based on how many words frequently used in malware are included in the inspection file. (For example, refer to Patent Document 1). In the proposed method, for example, teacher data that is data to be learned is converted into a set of several parameters, and learning is performed by a machine learning algorithm. Here, each parameter is called a feature, and a set of parameters is called a feature vector. For example, a combination of a word name and the number of appearances is a feature vector. Various methods for improving classification accuracy (F value, etc.) and learning processing speed in a method using heuristic determination using machine learning technology have been proposed.

例えば、ある手法においては、実行ファイルのＰＥヘッダ情報を用いた機械学習によるマルウェア判定において、適切な次元圧縮と機械学習アルゴリズムを用いることで、９９％の検知率かつ０．５％以下の誤検知率を達成している（例えば、非特許文献１参照）。 For example, in one method, in the malware determination by machine learning using the PE header information of the executable file, by using an appropriate dimension compression and a machine learning algorithm, a detection rate of 99% and a false detection of 0.5% or less The rate is achieved (see, for example, Non-Patent Document 1).

また、機械学習において、分類精度を悪化させると思われるデータを教師データから除去する等、教師データの精査を行うことで精度向上を図る事例選択に関する手法が提案されている。例えば、ＳＶＭ（Support Vector Machine：サポートベクターマシン）を用いた画像分類において、ＳＶＭの内部パラメータα_iを用い、分類しにくい曖昧な画像データを抽出し、これらのデータを教師データから取り除く事例選択手法が提案されている（例えば、非特許文献２参照）。また、スパース表現分類法（ＳＲＣ法）を用いた画像分類において、学習時にその分類の確からしさを表すスコアを用いて寄与率を算出し、寄与率が高い、すなわち、より確からしい典型的事例のみを教師データとすることで、分類精度を高く保ちつつ、学習処理時間の短縮を実現する手法が提案されている（例えば、特許文献２参照）。さらに、検知結果をホワイトリストと照らし合わせる方法が一般的に知られている。 Also, in machine learning, a method related to case selection that improves accuracy by examining the teacher data, such as removing data that seems to deteriorate the classification accuracy from the teacher data, has been proposed. For example, in image classification using SVM (Support Vector Machine), a case selection method that uses internal parameters α _{i of} SVM to extract ambiguous image data that is difficult to classify and removes these data from teacher data Has been proposed (see, for example, Non-Patent Document 2). Also, in image classification using the sparse representation classification method (SRC method), the contribution rate is calculated using a score representing the certainty of the classification during learning, and only a typical case with a high contribution rate, that is, a more probable probability. A technique has been proposed in which learning data is shortened while maintaining high classification accuracy by using as teacher data (see, for example, Patent Document 2). Furthermore, a method for comparing a detection result with a white list is generally known.

特開２０１２−２７７１０号公報JP 2012-27710 A 特開２０１２−１７３７９５号公報JP 2012-173895 A

Shafiq, et al., "PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime", RAID '09, 2009.Shafiq, et al., "PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime", RAID '09, 2009. 高取等、「サポートベクターマシンの内部パラメータに基づく事例選択手法の提案と映像境界検出問題への応用」、人工知能学会全国大会（第22回）、2008.Takatori et al., “Proposal of case selection method based on internal parameters of support vector machine and application to video boundary detection problem”, Japanese Society for Artificial Intelligence (22nd), 2008.

しかしながら、上記の従来技術では、マルウェア判定における誤検知を十分に低減できているとはいえず、また、ホワイトリストを用いる手法においては、ホワイトリスト照合の処理時間やホワイトリストの記憶領域等が余分に必要となるという問題がある。 However, the above-mentioned conventional technology cannot sufficiently reduce the false detection in malware judgment, and the method using the white list requires an extra processing time for white list matching, a white list storage area, and the like. There is a problem that is necessary.

一般的に、マルウェア判定においては、判定対象となるグッドウェアの数が多いため、特に低い誤検知率を要求される。例えば、Microsoft社製のＯＳであるWindows7のシステムディレクトリには２０００個以上の実行形式ファイルが存在する。そのため、誤検知率０．５％でも、１０個以上のグッドウェアが誤ってマルウェアと誤検知されることになるため、誤検知率が十分に低いとは言えない。 Generally, in malware determination, since the number of goodware to be determined is large, a particularly low false detection rate is required. For example, there are 2000 or more executable files in the system directory of Windows 7 which is a Microsoft OS. Therefore, even if the false detection rate is 0.5%, 10 or more pieces of goodware are mistakenly detected as malware. Therefore, it cannot be said that the false detection rate is sufficiently low.

また、あらかじめ記録したホワイトリストを作成し、判定時にこのホワイトリストを参照し合致したものをグッドウェアと判定することで誤検知を訂正する方法が知られているが、前述の通り、ホワイトリストを記憶するための記憶領域や照合のための処理時間が必要になるため、効率性が低下してしまう。 In addition, there is a known method of correcting a false detection by creating a pre-recorded white list and referring to this white list at the time of determination and determining that the match is goodware. Since a storage area for storing and a processing time for collation are required, efficiency is lowered.

そこで、本発明の目的は、機械学習を用いた、効率的かつ誤検知の少ないマルウェア判定を行うことにある。 Therefore, an object of the present invention is to perform malware determination using machine learning efficiently and with few false detections.

本発明のマルウェア判定のための精査教師データ出力装置は、悪性であることが既知のファイルである既存マルウェアデータ、良性であることが既知のファイルである既存グッドウェアデータ、および良性であることが既知であるが悪性と判定されたファイルである誤検知グッドウェアデータの各特徴を特徴ベクトルとしてそれぞれ抽出する特徴抽出部と、前記既存マルウェアデータおよび前記既存グッドウェアデータを含む教師データから、前記誤検知グッドウェアデータの特徴ベクトルと特徴ベクトルが類似している既存マルウェアデータを削除する類似マルウェア除去部と、前記誤検知グッドウェアデータを、前記既存グッドウェアデータとして、前記教師データに挿入する誤検知グッドウェア挿入部と、を有することを特徴とする。 The scrutinizing teacher data output device for malware determination of the present invention may be existing malware data that is a file known to be malignant, existing goodware data that is a file known to be benign, and benign. From the feature extraction unit that extracts each feature of erroneously detected goodware data that is a known but determined to be malignant as a feature vector, and the teacher data including the existing malware data and the existing goodware data, the error A similar malware removal unit that deletes existing malware data whose feature vector is similar to the feature vector of the detected goodware data, and a false detection that inserts the erroneously detected goodware data into the teacher data as the existing goodware data And a good wear insertion portion.

また、本発明のマルウェア判定システムは、悪性であることが既知のファイルである既存マルウェアデータ、良性であることが既知のファイルである既存グッドウェアデータ、および良性であることが既知であるが悪性と判定されたファイルである誤検知グッドウェアデータの各特徴を特徴ベクトルとしてそれぞれ抽出する第１の特徴抽出部と、前記既存マルウェアデータおよび前記既存グッドウェアデータを含む教師データから、前記誤検知グッドウェアデータの特徴ベクトルと特徴ベクトルが類似している既存マルウェアデータを削除する類似マルウェア除去部と、前記誤検知グッドウェアデータを、前記既存グッドウェアデータとして、前記教師データに挿入する誤検知グッドウェア挿入部と、を有する精査教師データ出力装置と、前記誤検知グッドウェア挿入部によって前記誤検知グッドウェアデータが挿入された教師データおよびマルウェアであるか否かを判定する対象となる対象ファイルの各特徴をそれぞれ特徴ベクトルとして抽出する第２の特徴抽出部と、前記教師データの特徴ベクトルを学習し、前記対象ファイルの特徴ベクトルから前記対象ファイルのマルウェアらしさを示すスコアを算出する第１の分類器と、前記スコアに基づいて前記対象ファイルがマルウェアであるか否かを判定する判定部と、を有する学習判定装置と、を備えたことを特徴とする。 In addition, the malware determination system of the present invention includes existing malware data that is a file known to be malignant, existing goodware data that is a file known to be benign, and malignant that is known to be benign. From the first feature extraction unit that extracts each feature of the erroneously detected goodware data that is a file determined as a feature vector, and the teacher data including the existing malware data and the existing goodware data, the erroneously detected good A similar malware removal unit that deletes existing malware data whose feature vector is similar to the feature vector of the wear data, and the false detection goodware that inserts the false detection goodware data into the teacher data as the existing goodware data A scrutinizing teacher data output device having an insertion portion, and a front A second feature extraction unit that extracts each feature of a target file that is a target for determining whether or not it is teacher data and malware with the erroneous detection goodware data inserted by the erroneous detection goodware insertion unit. A first classifier that learns the feature vector of the teacher data and calculates a score indicating the malware likeness of the target file from the feature vector of the target file; and the target file is malware based on the score A learning determination device having a determination unit for determining whether or not.

また、本発明のマルウェア判定方法は、マルウェア判定システムで実行されるマルウェア判定方法であって、悪性であることが既知のファイルである既存マルウェアデータ、良性であることが既知のファイルである既存グッドウェアデータ、および良性であることが既知であるが悪性と判定されたファイルである誤検知グッドウェアデータの各特徴を特徴ベクトルとしてそれぞれ抽出する第１の特徴抽出工程と、前記既存マルウェアデータおよび前記既存グッドウェアデータを含む教師データから、前記誤検知グッドウェアデータの特徴ベクトルと特徴ベクトルが類似している既存マルウェアデータを削除する類似マルウェア除去工程と、前記誤検知グッドウェアデータを、前記既存グッドウェアデータとして、前記教師データに挿入する誤検知グッドウェア挿入工程と、前記誤検知グッドウェア挿入工程によって前記誤検知グッドウェアデータが挿入された教師データおよびマルウェアであるか否かを判定する対象となる対象ファイルの各特徴をそれぞれ特徴ベクトルとして抽出する第２の特徴抽出工程と、前記教師データの特徴ベクトルを学習し、前記対象ファイルの特徴ベクトルから前記対象ファイルのマルウェアらしさを示すスコアを算出する分類工程と、前記スコアに基づいて前記対象ファイルがマルウェアであるか否かを判定する判定工程と、を含んだことを特徴とする。 In addition, the malware determination method of the present invention is a malware determination method executed by the malware determination system. The existing malware data is a file that is known to be malignant, and the existing good that is a file that is known to be benign. A first feature extraction step of extracting each feature of the wear data and the misdetected goodware data which is a file known to be benign but determined to be malignant as a feature vector, the existing malware data and the A similar malware removal step of deleting existing malware data having a feature vector similar to the feature vector of the erroneously detected goodware data from the teacher data including the existing goodware data, and the erroneously detected goodware data to the existing goodware data. Wear data into the teacher data Each feature of the target file that is a target for determining whether or not it is a teacher data and malware in which the erroneously detected goodware data is inserted by the known goodware inserting step and the erroneously detected goodware inserting step is used as a feature vector. A second feature extraction step to extract, a classification step of learning a feature vector of the teacher data and calculating a score indicating the malware likeness of the target file from the feature vector of the target file; and the target based on the score And a determination step of determining whether or not the file is malware.

本発明によれば、機械学習を用いた、効率的かつ誤検知の少ないマルウェア判定を行うことができる。 According to the present invention, it is possible to perform malware determination using machine learning efficiently and with few false detections.

図１は、第１の実施形態に係るマルウェア判定システムの構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of the malware determination system according to the first embodiment. 図２は、第１の実施形態に係るマルウェア判定システムにおける事例選択処理の一例を示すフローチャートである。FIG. 2 is a flowchart illustrating an example of a case selection process in the malware determination system according to the first embodiment. 図３は、第１の実施形態に係るマルウェア判定システムにおける学習処理の一例を示すフローチャートである。FIG. 3 is a flowchart illustrating an example of learning processing in the malware determination system according to the first embodiment. 図４は、第１の実施形態に係るマルウェア判定システムにおける判定処理の一例を示すフローチャートである。FIG. 4 is a flowchart illustrating an example of determination processing in the malware determination system according to the first embodiment. 図５は、第２の実施形態に係るマルウェア判定システムの構成の一例を示す図である。FIG. 5 is a diagram illustrating an example of the configuration of the malware determination system according to the second embodiment. 図６は、第２の実施形態に係る誤検知グッドウェアデータ出力装置の構成の一例を示す図である。FIG. 6 is a diagram illustrating an example of a configuration of a false detection goodware data output apparatus according to the second embodiment. 図７は、第２の実施形態に係るマルウェア判定システムにおける処理の一例を示すフローチャートである。FIG. 7 is a flowchart illustrating an example of processing in the malware determination system according to the second embodiment. 図８は、その他の実施形態に係るマルウェア判定装置の構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of a configuration of a malware determination device according to another embodiment. 図９は、プログラムが実行されることにより、マルウェア判定装置や精査教師データ出力装置が実現されるコンピュータの一例を示す図である。FIG. 9 is a diagram illustrating an example of a computer that realizes a malware determination device and a close examination teacher data output device by executing a program.

以下に、本願に係るマルウェア判定のための精査教師データ出力装置、マルウェア判定システム、マルウェア判定方法およびマルウェア判定のための精査教師データ出力プログラムを図面に基づいて詳細に説明する。なお、この実施形態により本願に係るマルウェア判定のための精査教師データ出力装置、マルウェア判定システム、マルウェア判定方法およびマルウェア判定のための精査教師データ出力プログラムが限定されるものではない。 Hereinafter, a scrutinizing teacher data output device for malware determination, a malware determination system, a malware determination method, and a scrutinizing teacher data output program for malware determination according to the present application will be described in detail with reference to the drawings. The embodiment does not limit the scrutinizing teacher data output device for malware determination, the malware determination system, the malware determination method, and the scrutinizing teacher data output program for malware determination according to this embodiment.

［第１の実施形態の構成］
まず、図１を用いて、第１の実施形態の構成について説明する。図１は、第１の実施形態に係るマルウェア判定システムの構成の一例を示す図である。図１に示すように、マルウェア判定システム１は、精査教師データ出力装置１０および学習・判定装置２０を有する。精査教師データ出力装置１０は、既存マルウェアデータ１０１および既存グッドウェアデータ１０２から構成される教師データ１００と、誤検知グッドウェアデータ１１０と、を用いて事例選択を行い、精査教師データ１２０を出力する。また、学習・判定装置２０は、精査教師データ１２０を学習し、対象ファイル１３０の判定を行い、判定結果１４０を出力する。 [Configuration of First Embodiment]
First, the configuration of the first embodiment will be described with reference to FIG. FIG. 1 is a diagram illustrating an example of the configuration of the malware determination system according to the first embodiment. As shown in FIG. 1, the malware determination system 1 includes a scrutinizing teacher data output device 10 and a learning / determination device 20. The scrutinizing teacher data output device 10 performs case selection using the teacher data 100 composed of the existing malware data 101 and the existing goodware data 102 and the erroneous detection goodware data 110, and outputs the scrutinizing teacher data 120. . Further, the learning / determination device 20 learns the scrutinizing teacher data 120, determines the target file 130, and outputs a determination result 140.

精査教師データ出力装置１０は、特徴抽出部１１、次元削減部１２、類似マルウェア除去部１３および誤検知グッドウェア挿入部１４を有する。なお、教師データ１００の既存マルウェアデータ１０１および既存グッドウェアデータ１０２は、既存のマルウェアとグッドウェアの実行ファイル、およびその実行ファイルがマルウェアであるかグッドウェアであるかの分類を示す情報を含んでいる。そして、誤検知グッドウェアデータ１１０は、複数の誤検知したグッドウェアの実行ファイルで構成される。また、誤検知グッドウェアデータ１１０は教師データ１００中に含まれない実行ファイルでもよい。例えば、ユーザから誤検知の申告があったグッドウェア等を含めることができる。なお、精査教師データ出力装置１０で行われる処理を事例選択と呼ぶこともでき、さらに、精査教師データ出力装置を、事例選択装置等と呼ぶこともできる。 The scrutinizing teacher data output device 10 includes a feature extraction unit 11, a dimension reduction unit 12, a similar malware removal unit 13, and a false detection goodware insertion unit 14. Note that the existing malware data 101 and the existing goodware data 102 of the teacher data 100 include information indicating the existing malware and the executable file of the goodware, and the classification indicating whether the executable file is malware or goodware. Yes. The erroneous detection goodware data 110 includes a plurality of erroneously detected goodware execution files. The erroneously detected goodware data 110 may be an executable file that is not included in the teacher data 100. For example, it is possible to include goodware or the like for which a false detection is reported by the user. Note that the processing performed by the scrutinizing teacher data output device 10 can also be called case selection, and the scrutinizing teacher data output device can also be called a case selection device or the like.

特徴抽出部１１は、悪性であることが既知のファイルである既存マルウェアデータ１０１、良性であることが既知のファイルである既存グッドウェアデータ１０２および良性であることが既知であるが悪性と判定されたファイルである誤検知グッドウェアデータ１１０の各特徴をそれぞれ特徴ベクトルとして抽出する。この時、特徴抽出部１１は、特徴ベクトルへの変換を行う前に、精度向上を目的として各特徴の取捨選択を行うようにしてもよい。また、次元削減部１２は、主成分分析等を利用して、特徴ベクトルの次元圧縮を行う。ここで、次元削減部１２は、主成分分析を用いることで相関がある特徴同士を自動的に１つの特徴にまとめることができる。 The feature extraction unit 11 determines that the existing malware data 101 is a file that is known to be malignant, the existing goodware data 102 that is a file that is known to be benign, and a benign that is known to be benign, but is determined to be malignant. Each feature of the erroneously detected goodware data 110 that is a file is extracted as a feature vector. At this time, the feature extraction unit 11 may perform selection of each feature for the purpose of improving accuracy before the conversion to the feature vector. The dimension reduction unit 12 performs dimension compression of the feature vector using principal component analysis or the like. Here, the dimension reduction unit 12 can automatically combine the correlated features into one feature by using the principal component analysis.

類似マルウェア除去部１３は、既存マルウェアデータ１０１および既存グッドウェアデータ１０２を含む教師データ１００から、誤検知グッドウェアデータ１１０の特徴ベクトルと特徴ベクトルが類似している既存マルウェアデータ１０１を削除する。具体的には、まず、類似マルウェア除去部１３は、既存マルウェアデータ１０１の実行ファイルの特徴ベクトルと、それぞれの誤検知グッドウェアデータ１１０の実行ファイルの特徴ベクトルを用いて類似度を算出する。そして、類似マルウェア除去部１３は、算出した類似度を条件判定することで、類似か否かを決定し、類似と決定した教師データ１００中の実行ファイルを既存マルウェアデータ１０１、すなわち教師データ１００の中から除去する。 The similar malware removing unit 13 deletes the existing malware data 101 whose feature vector is similar to the feature vector of the erroneously detected goodware data 110 from the teacher data 100 including the existing malware data 101 and the existing goodware data 102. Specifically, first, the similar malware removing unit 13 calculates the similarity using the feature vector of the executable file of the existing malware data 101 and the feature vector of the executable file of each false detection goodware data 110. Then, the similar malware removing unit 13 determines whether or not the similarity is similar by performing a condition determination on the calculated similarity, and determines the execution file in the teacher data 100 determined to be similar to the existing malware data 101, that is, the teacher data 100. Remove from inside.

ここで、類似マルウェア除去部１３は、類似度として、ユークリッド距離、コサイン類似度、Jaccard係数等のいずれかを用いることができる。例えば、ユークリッド距離は当該の２つの特徴ベクトルの差の内積の平方根と定義される。そこで、類似マルウェア除去部１３は、例えば、ユークリッド距離が指定の閾値以下なら互いに類似していると決定するようにしてもよい。 Here, the similar malware removal unit 13 can use any one of Euclidean distance, cosine similarity, Jaccard coefficient, and the like as the similarity. For example, the Euclidean distance is defined as the square root of the inner product of the difference between the two feature vectors. Therefore, for example, the similar malware removing unit 13 may determine that they are similar if the Euclidean distance is equal to or less than a specified threshold.

また、コサイン類似度は、当該の２つの特徴ベクトルがなす角度の余弦（cosine）と定義される。そこで、類似マルウェア除去部１３は、例えば、コサイン類似度が指定の閾値以上なら互いに類似していると決定するようにしてもよい。さらに、Jaccard係数は、２つの特徴ベクトルの値が０でない各要素について、互いに合致している要素の個数を全要素の個数で割ったものと定義される。そこで、類似マルウェア除去部１３は、例えば、Jaccard係数が指定の閾値以上なら互いに類似していると決定するようにしてもよい。 The cosine similarity is defined as the cosine of the angle formed by the two feature vectors. Therefore, for example, the similar malware removing unit 13 may determine that they are similar to each other if the cosine similarity is equal to or greater than a specified threshold value. Further, the Jaccard coefficient is defined as the number of elements that match each other divided by the number of all elements for each element whose two feature vector values are not zero. Therefore, for example, the similar malware removing unit 13 may determine that they are similar if the Jaccard coefficient is equal to or greater than a specified threshold value.

誤検知グッドウェア挿入部１４は、誤検知グッドウェアデータ１１０を、教師データ１００に挿入し、精査教師データ１２０として出力する。ここで、機械学習アルゴリズムを用いた分類器には、教師データの並び順がその分類精度に大きく影響するものがある。そのため、後述する学習・判定装置２０の分類器２３の種類に応じて、誤検知グッドウェアデータ１１０を適切な並び順で教師データ１００へ挿入することより、誤検知を低減させることができる。 The erroneous detection goodware insertion unit 14 inserts the erroneous detection goodware data 110 into the teacher data 100 and outputs it as the scrutinizing teacher data 120. Here, in some classifiers using machine learning algorithms, the arrangement order of teacher data greatly affects the classification accuracy. Therefore, erroneous detection can be reduced by inserting the erroneous detection goodware data 110 into the teacher data 100 in an appropriate arrangement order according to the type of the classifier 23 of the learning / determination device 20 described later.

例えば、分類器２３としてオンライン機械学習アルゴリズムＡＲＯＷ（Adaptive Regularization of Weight Vectors）を用いた場合は、誤検知グッドウェア挿入部１４が誤検知グッドウェアデータ１１０を教師データ１００の最後尾へ挿入することで、より誤検知を低減できることが観察された。一方、分類器２３としてサポートベクターマシンを用いた場合は、誤検知グッドウェア挿入部１４が誤検知グッドウェアデータ１１０を教師データ１００の先頭へ挿入することで、より誤検知を低減できることが観察された。また、分類器２３としてパーセプトロン等を用いた場合は、誤検知グッドウェアデータ１１０と同じものを複数個、教師データ１００の異なった位置に挿入して、誤検知グッドウェアデータ１１０の学習回数を増やすことが有効であった。 For example, when an online machine learning algorithm AROW (Adaptive Regularization of Weight Vectors) is used as the classifier 23, the erroneous detection goodware insertion unit 14 inserts the erroneous detection goodware data 110 at the end of the teacher data 100. It was observed that false detection can be further reduced. On the other hand, when a support vector machine is used as the classifier 23, it is observed that the false detection goodware insertion unit 14 inserts the false detection goodware data 110 at the head of the teacher data 100, thereby further reducing the false detection. It was. Further, when a perceptron or the like is used as the classifier 23, the same number as the erroneously detected goodware data 110 is inserted at different positions in the teacher data 100 to increase the number of times the erroneously detected goodware data 110 is learned. It was effective.

最終的に、精査教師データ出力装置１０は、類似マルウェア除去部１３および誤検知グッドウェア挿入部１４で修正された教師データ１００を精査教師データ１２０として出力する。精査教師データ１２０は、教師データ１００と同様に、既存のマルウェアとグッドウェアの実行ファイル、および、その実行ファイルがマルウェアであるかグッドウェアであるかの分類を示す情報で構成される。ただし、実行ファイルの代わりに、学習・判定装置２０で必要とする部分のみを取り出した断片情報であってもよい。例えば、学習・判定装置２０がＰＥヘッダを用いたマルウェア判定を行う場合は、ＰＥヘッダ情報の領域部分のみ取り出した断片情報を精査教師データ１２０としてもよい。精査教師データ１２０は、例えば、固定記憶媒体、メモリ、通信等の媒体上の情報である。 Finally, the scrutinizing teacher data output device 10 outputs the teacher data 100 corrected by the similar malware removing unit 13 and the erroneous detection goodware inserting unit 14 as the scrutinizing teacher data 120. Similar to the teacher data 100, the scrutinizing teacher data 120 includes existing malware and goodware executable files, and information indicating the classification of whether the executable file is malware or goodware. However, instead of the execution file, fragment information obtained by extracting only a part required by the learning / determination device 20 may be used. For example, when the learning / determination device 20 performs the malware determination using the PE header, the fragment information obtained by extracting only the region portion of the PE header information may be used as the close examination teacher data 120. The scrutinizing teacher data 120 is information on a medium such as a fixed storage medium, a memory, or communication.

学習・判定装置２０は、特徴抽出部２１、次元削減部２２、分類器２３および判定部２４を有する。特徴抽出部２１は、精査教師データ１２０および対象ファイル１３０の特徴を特徴ベクトルとして抽出する。ここで、対象ファイル１３０は、マルウェアであるかグッドウェアであるかの分類が未知の実行ファイルである。また、次元削減部２２は、次元削減部１２と同様に、次元圧縮を行う。なお、特徴抽出部２１および次元削減部２２は、特徴抽出部１１および次元削減部１２と同様の機能を有するものであってもよい。 The learning / determination device 20 includes a feature extraction unit 21, a dimension reduction unit 22, a classifier 23, and a determination unit 24. The feature extraction unit 21 extracts features of the scrutinizing teacher data 120 and the target file 130 as feature vectors. Here, the target file 130 is an executable file whose classification as malware or goodware is unknown. The dimension reduction unit 22 performs dimension compression in the same manner as the dimension reduction unit 12. Note that the feature extraction unit 21 and the dimension reduction unit 22 may have the same functions as the feature extraction unit 11 and the dimension reduction unit 12.

分類器２３は、精査教師データ１２０の特徴ベクトルを学習し、対象ファイル１３０の特徴ベクトルから対象ファイル１３０のマルウェアらしさを示すスコアを算出する。具体的には、分類器２３は、精査教師データ１２０の特徴ベクトルおよび精査教師データ１２０の実行ファイルがマルウェアであるかグッドウェアであるかの分類の情報を受け取り、機械学習を行う。例えば、分類器２３としては、ロジスティック回帰、サポートベクターマシン、パーセプトロン、ＡＲＯＷ、単純ベイズ等、の分類器を用いることができる。 The classifier 23 learns the feature vector of the scrutinizing teacher data 120 and calculates a score indicating the malware likeness of the target file 130 from the feature vector of the target file 130. Specifically, the classifier 23 receives the feature vector of the scrutinizing teacher data 120 and information on the classification of whether the execution file of the scrutinizing teacher data 120 is malware or goodware, and performs machine learning. For example, as the classifier 23, a classifier such as logistic regression, support vector machine, perceptron, AROW, naive Bayes, or the like can be used.

機械学習を行った分類器２３は、対象ファイル１３０の特徴ベクトルから、対象ファイル１３０のマルウェアらしさをスコアという数値で出力する。そして、判定部２４は、スコアを基にマルウェアか否かを判定し、判定結果を出力する。なお、判定結果はマルウェアまたはグッドウェアの分類のみでもよいし、スコアを含めてもよい。また、判定部２４は、例えば、スコアがある閾値以上ならマルウェア、閾値以下ならグッドウェアとする閾値判定を用いてもよい。 The classifier 23 that has performed machine learning outputs the malware likeness of the target file 130 from the feature vector of the target file 130 as a numerical value called a score. And the determination part 24 determines whether it is malware based on a score, and outputs a determination result. The determination result may be only the classification of malware or goodware, or may include a score. The determination unit 24 may use threshold determination, for example, when the score is equal to or higher than a certain threshold, and malware when the score is equal to or lower than the threshold.

判定方法として閾値判定を用いる場合、閾値を増大させると、検知率は下がるが誤検知率も下がり、逆に、閾値を減少させると、検知率・誤検知率が上がる。ただし、閾値と検知率、誤検知率の関係は比例関係ではなく、ＲＯＣ曲線に従い、ある閾値の範囲では閾値を増やしても誤検知率はそれほど上がらず、検知率は非常に上がることが知られている。そのため、あらかじめ、適切に閾値を調整することにより、許容範囲内に誤検知率を収めつつ、比較的高い検知率を得られるようにすることができる。例えば、テスト用の対象ファイルを用意・判定（ホールドアウト検証）し、得られたスコアから、許容範囲内に誤検知率が収まるように閾値を決める等の調整方法がある。 When threshold determination is used as the determination method, if the threshold value is increased, the detection rate decreases, but the false detection rate also decreases. Conversely, if the threshold value is decreased, the detection rate / false detection rate increases. However, the relationship between the threshold, the detection rate, and the false detection rate is not a proportional relationship. According to the ROC curve, increasing the threshold within a certain threshold range does not increase the false detection rate so much, and the detection rate is known to increase greatly. ing. Therefore, by appropriately adjusting the threshold value in advance, it is possible to obtain a relatively high detection rate while keeping the false detection rate within the allowable range. For example, there is an adjustment method in which a target file for testing is prepared and determined (holdout verification), and a threshold value is determined so that the false detection rate is within an allowable range from the obtained score.

［第１の実施形態の処理］
図２〜４を用いて、第１の実施形態の処理について説明する。図２〜４は、第１の実施形態に係るマルウェア判定システムにおける各処理の一例を示すフローチャートである。図２を用いて、事例選択処理について説明する。ここで、事例選択処理とは、前述の通り、精査教師データ出力装置１０で行われる一連の処理を示すものである。 [Process of First Embodiment]
The process of the first embodiment will be described with reference to FIGS. 2-4 is a flowchart which shows an example of each process in the malware determination system which concerns on 1st Embodiment. The case selection process will be described with reference to FIG. Here, the case selection process indicates a series of processes performed by the scrutinizing teacher data output device 10 as described above.

図２に示すように、まず、特徴抽出部１１は、教師データ１００および誤検知グッドウェアデータ１１０から特徴ベクトルを抽出する（ステップＳ１０１）。なお、教師データ１００は、既存マルウェアデータ１０１および既存グッドウェアデータ１０２を含んでいる。次に、次元削減部１２は、特徴抽出部１１で抽出した特徴ベクトルの次元の圧縮を行う（ステップＳ１０２）。 As shown in FIG. 2, the feature extraction unit 11 first extracts a feature vector from the teacher data 100 and the erroneous detection goodware data 110 (step S101). The teacher data 100 includes existing malware data 101 and existing goodware data 102. Next, the dimension reduction unit 12 compresses the dimension of the feature vector extracted by the feature extraction unit 11 (step S102).

そして、類似マルウェア除去部１３は、特徴ベクトルを用いて、誤検知グッドウェアデータ１１０と類似の既存マルウェアデータ１０１を抽出し、教師データ１００から削除する（ステップＳ１０３）。その後、誤検知グッドウェア挿入部１４は、誤検知グッドウェアデータ１１０を、教師データ１００の適切な位置へ挿入し（ステップＳ１０４）、精査教師データ１２０を出力する。なお、誤検知グッドウェアデータ１１０を挿入する適切な位置とは、教師データ１００の先頭、最後尾等が挙げられ、精査教師データ１２０で最も学習されやすい位置や、精査教師データ１２０を学習した分類器の誤検知率が低くなるような位置である。 And the similar malware removal part 13 extracts the existing malware data 101 similar to the misdetection goodware data 110 using a feature vector, and deletes it from the teacher data 100 (step S103). Thereafter, the erroneous detection goodware insertion unit 14 inserts the erroneous detection goodware data 110 into an appropriate position of the teacher data 100 (step S104), and outputs the examination teacher data 120. Note that the appropriate position where the misdetected goodware data 110 is inserted includes the beginning, the end, and the like of the teacher data 100, the position that is most easily learned in the scrutinizing teacher data 120, and the classification in which the scrutinizing teacher data 120 is learned. It is a position where the false detection rate of the vessel becomes low.

次に、図３を用いて、学習・判定装置２０で行われる学習処理について説明する。図３に示すように、まず、特徴抽出部２１は、精査教師データ１２０から特徴ベクトルを抽出する（ステップＳ１１１）。次に、次元削減部２２は、特徴抽出部２１で抽出した特徴ベクトルの次元の圧縮を行う（ステップＳ１１２）。そして、分類器２３にて精査教師データ１２０の学習を行う（ステップＳ１１３）。 Next, the learning process performed by the learning / determination device 20 will be described with reference to FIG. As shown in FIG. 3, the feature extraction unit 21 first extracts a feature vector from the scrutinizing teacher data 120 (step S111). Next, the dimension reduction unit 22 compresses the dimension of the feature vector extracted by the feature extraction unit 21 (step S112). Then, the classifier 23 learns the scrutinizing teacher data 120 (step S113).

最後に、図４を用いて、学習・判定装置２０で行われる判定処理について説明する。図４に示すように、まず、特徴抽出部２１は、対象ファイル１３０から特徴ベクトルを抽出する（ステップＳ１２１）。次に、次元削減部２２は、特徴抽出部２１で抽出した特徴ベクトルの次元の圧縮を行う（ステップＳ１２２）。そして、分類器２３は、特徴ベクトルから、対象ファイル１３０のマルウェアらしさのスコアを算出し（ステップＳ１２３）、判定部２４はスコアを基に対象ファイル１３０がマルウェアであるか否かを判定する（ステップＳ１２４）。 Finally, the determination process performed by the learning / determination device 20 will be described with reference to FIG. As shown in FIG. 4, first, the feature extraction unit 21 extracts a feature vector from the target file 130 (step S121). Next, the dimension reduction unit 22 compresses the dimension of the feature vector extracted by the feature extraction unit 21 (step S122). Then, the classifier 23 calculates a malware-like score of the target file 130 from the feature vector (step S123), and the determination unit 24 determines whether the target file 130 is malware based on the score (step S123). S124).

［第１の実施形態の効果］
精査教師データ出力装置１０は、既存マルウェアデータ１０１、既存グッドウェアデータ１０２および誤検知グッドウェアデータ１１０の各特徴を特徴ベクトルとしてそれぞれ抽出し、既存マルウェアデータ１０１および既存グッドウェアデータ１０２を含んだ教師データ１００から、誤検知グッドウェアデータ１１０の特徴ベクトルと特徴ベクトルが類似している既存マルウェアデータ１０１を削除し、誤検知グッドウェアデータ１１０を、既存グッドウェアデータ１０２として教師データ１００に挿入し、精査教師データ１２０として出力する。 [Effect of the first embodiment]
The scrutinizing teacher data output device 10 extracts the features of the existing malware data 101, the existing goodware data 102, and the false detection goodware data 110 as feature vectors, respectively, and includes the existing malware data 101 and the existing goodware data 102. The existing malware data 101 whose feature vector is similar to the feature vector of the erroneous detection goodware data 110 is deleted from the data 100, the erroneous detection goodware data 110 is inserted into the teacher data 100 as the existing goodware data 102, Output as scrutinizing teacher data 120.

従来、グッドウェアをマルウェアと判定する誤検知が発生していた原因として、分類器が該当のグッドウェアと特徴が類似したマルウェアを学習してしまうこと、および該当のグッドウェア自体がうまく学習されにくい条件となっていることの２つが挙げられる。本発明の第１の実施形態においては、誤検知グッドウェアデータを教師データから除去することで、分類器がグッドウェアと類似したマルウェアを学習してしまうことを抑制している。また、誤検知グッドウェアデータを教師データに含めることで、誤検知グッドウェアデータを学習しやすくしている。また、ホワイトリストを用いていないため、処理時間や記憶領域が余分に必要になることがない。以上より、本発明の実施形態１によれば、機械学習を用いた、効率的かつ誤検知の少ないマルウェア判定を行うことができる。 Previously, misclassifications that determined goodware as malware occurred as a result of the classifier learning malware with similar characteristics to the relevant goodware, and the relevant goodware itself was difficult to learn well Two of the conditions are listed. In the first embodiment of the present invention, it is possible to prevent the classifier from learning malware similar to goodware by removing erroneously detected goodware data from teacher data. In addition, by including erroneously detected goodware data in the teacher data, it is easy to learn erroneously detected goodware data. Also, since no white list is used, no extra processing time or storage area is required. As described above, according to Embodiment 1 of the present invention, it is possible to perform malware determination using machine learning efficiently and with few false detections.

また、精査教師データ出力装置１０は、精査教師データ１２０の特徴ベクトルを学習する分類器の種類に応じて、教師データ１００のうち、いずれの位置に誤検知グッドウェアデータ１１０を挿入するか決定し、該決定した位置に誤検知グッドウェアデータ１１０を挿入する。分類器の種類によって、先頭に挿入することで最も学習されやすくなる場合や、最後尾に挿入することで最も学習されやすくなる場合等がある。分類器の特性等を考慮して挿入する位置を決定することで、より誤検知率が低くなるように学習を行うことができる。また、誤検知グッドウェアデータと同じ複数のデータを教師データに挿入することによっても、より学習されやすくなるという効果が得られる。 Further, the scrutinizing teacher data output device 10 determines in which position of the teacher data 100 the erroneous detection goodware data 110 is inserted according to the type of classifier that learns the feature vector of the scrutinizing teacher data 120. Then, erroneous detection goodware data 110 is inserted at the determined position. Depending on the type of classifier, there are cases where it is most easily learned by inserting it at the beginning, and cases where it is most easily learned by inserting it at the end. By determining the insertion position in consideration of the characteristics of the classifier and the like, learning can be performed so that the false detection rate is further reduced. Moreover, the effect that it becomes easier to learn can also be obtained by inserting the same plurality of data as the erroneously detected goodware data into the teacher data.

［第２の実施形態の構成］
マルウェア判定システム１に、誤検知グッドウェアデータ１１０を出力するための装置を追加して、誤検知グッドウェアデータ１１０の出力からマルウェアの判定までの一連の機能を持たせることもできる。そこで、第２の実施形態では、図５に示すように、マルウェア判定システム１に誤検知グッドウェアデータ出力装置３０を含める。図５は、第２の実施形態に係るマルウェア判定システムの構成の一例を示す図である。 [Configuration of Second Embodiment]
A device for outputting the erroneous detection goodware data 110 may be added to the malware determination system 1 to have a series of functions from the output of the erroneous detection goodware data 110 to the malware determination. Therefore, in the second embodiment, as shown in FIG. 5, the erroneous detection goodware data output device 30 is included in the malware determination system 1. FIG. 5 is a diagram illustrating an example of the configuration of the malware determination system according to the second embodiment.

図５に示すように、誤検知グッドウェアデータ出力装置３０は、誤検知グッドウェア候補データ１０５と、教師データ１００である既存マルウェアデータ１０１および既存グッドウェアデータ１０２と、から誤検知グッドウェアデータ１１０を抽出し出力する。そして、出力された誤検知グッドウェアデータ１１０は実施形態１と同様に、精査教師データ出力装置１０および学習・判定装置２０にて使用される。なお、誤検知グッドウェア候補データ１０５は、マルウェア判定システム１で誤検知される可能性があるグッドウェアで構成される。例えば、誤検知グッドウェア候補データ１０５には、教師データ１００に含まれる既存グッドウェアデータ１０２の一部あるいは全てを含めてもよいし、ユーザから誤検知の申告があったグッドウェア等の教師データ１００に含まれないグッドウェアを含めてもよい。 As illustrated in FIG. 5, the erroneous detection goodware data output device 30 includes erroneous detection goodware data 110 from the erroneous detection goodware candidate data 105, the existing malware data 101 and the existing goodware data 102 that are the teacher data 100. Is extracted and output. Then, the output erroneous detection goodware data 110 is used in the scrutinizing teacher data output device 10 and the learning / determination device 20 as in the first embodiment. The erroneous detection goodware candidate data 105 includes goodware that may be erroneously detected by the malware determination system 1. For example, the erroneous detection goodware candidate data 105 may include a part or all of the existing goodware data 102 included in the teacher data 100, or teacher data such as goodware for which a false detection is reported by the user. Goodware not included in 100 may be included.

図６を用いて、誤検知グッドウェアデータ出力装置３０について詳しく説明する。図６は、第２の実施形態に係る誤検知グッドウェアデータ出力装置の構成の一例を示す図である。図６に示すように、誤検知グッドウェアデータ出力装置３０は、学習・判定部３０ａおよび誤検知グッドウェア選出部４０を有する。また、学習・判定部３０ａは、特徴抽出部３１、次元削減部３２、分類器３３および判定部３４を有する。 The erroneous detection goodware data output device 30 will be described in detail with reference to FIG. FIG. 6 is a diagram illustrating an example of a configuration of a false detection goodware data output apparatus according to the second embodiment. As shown in FIG. 6, the erroneous detection goodware data output device 30 includes a learning / determination unit 30 a and an erroneous detection goodware selection unit 40. The learning / determination unit 30 a includes a feature extraction unit 31, a dimension reduction unit 32, a classifier 33, and a determination unit 34.

学習・判定部３０ａの各部の機能は、第１の実施形態における学習・判定装置２０の各部の機能と同様である。すなわち、特徴抽出部３１は、教師データ１００および誤検知グッドウェア候補データ１０５の特徴を特徴ベクトルとして抽出する。また、次元削減部３２は、次元削減部２２と同様に、次元圧縮を行う。 The function of each part of the learning / determination unit 30a is the same as the function of each part of the learning / determination device 20 in the first embodiment. That is, the feature extraction unit 31 extracts features of the teacher data 100 and the erroneous detection goodware candidate data 105 as feature vectors. In addition, the dimension reduction unit 32 performs dimension compression in the same manner as the dimension reduction unit 22.

分類器３３は、教師データ１００の特徴ベクトルを学習し、誤検知グッドウェア候補データ１０５の特徴ベクトルから誤検知グッドウェア候補データ１０５のマルウェアらしさを示すスコアを算出する。具体的には、分類器３３は、教師データ１００の特徴ベクトルおよび教師データ１００の実行ファイルがマルウェアであるかグッドウェアであるかの分類の情報で機械学習を行う。機械学習を行った分類器３３は、誤検知グッドウェア候補データ１０５の特徴ベクトルから、誤検知グッドウェア候補データ１０５のマルウェアらしさをスコアという数値で出力する。そして、判定部３４は、スコア付き判定結果１０６を出力する。ここで、スコア付き判定結果１０６とは、グッドウェアを識別する名称（例えば、ファイル名、ファイルハッシュ値、または、誤検知グッドウェア候補データ順序番号等）と、そのスコアで構成される。 The classifier 33 learns the feature vector of the teacher data 100 and calculates a score indicating the malware likeness of the erroneous detection goodware candidate data 105 from the feature vector of the erroneous detection goodware candidate data 105. Specifically, the classifier 33 performs machine learning based on the feature vector of the teacher data 100 and information on the classification of whether the execution file of the teacher data 100 is malware or goodware. The classifier 33 that has performed machine learning outputs the malware-likeness of the false positive goodware candidate data 105 as a numerical value called a score from the feature vector of the false positive goodware candidate data 105. And the determination part 34 outputs the determination result 106 with a score. Here, the scored determination result 106 includes a name for identifying goodware (for example, a file name, a file hash value, or an erroneously detected goodware candidate data sequence number) and its score.

そして、誤検知グッドウェア選出部４０は、スコア付き判定結果１０６に基づき、指定の条件に従って、誤検知グッドウェア候補データ１０５から、誤検知グッドウェアを選び出し、誤検知グッドウェアデータ１１０として出力する。なお、指定の条件の例としては、指定閾値以上のスコアとなったグッドウェアを選出すること、またはスコアの高い順に指定個数分のグッドウェアを選出すること等が挙げられる。 Then, the erroneous detection goodware selection unit 40 selects the erroneous detection goodware from the erroneous detection goodware candidate data 105 according to the specified condition based on the scored determination result 106, and outputs the erroneous detection goodware data 110. Examples of the designated condition include selecting goodware having a score equal to or higher than the specified threshold, or selecting the specified number of goodware in descending order of score.

［第２の実施形態の処理］
図７を用いて、第２の実施形態の処理について説明する。図７は、第２の実施形態に係るマルウェア判定システムにおける処理の一例を示すフローチャートである。図７に示すように、まず、特徴抽出部３１は、教師データ１００から特徴ベクトルを抽出する（ステップＳ２０１）。なお、教師データ１００は、既存マルウェアデータ１０１および既存グッドウェアデータ１０２を含んでいる。次に、次元削減部３２は、特徴抽出部３１で抽出した特徴ベクトルの次元の圧縮を行う（ステップＳ２０２）。そして、分類器３３にて教師データ１００の学習を行う（ステップＳ２０３）。 [Process of Second Embodiment]
The processing of the second embodiment will be described with reference to FIG. FIG. 7 is a flowchart illustrating an example of processing in the malware determination system according to the second embodiment. As shown in FIG. 7, the feature extraction unit 31 first extracts a feature vector from the teacher data 100 (step S201). The teacher data 100 includes existing malware data 101 and existing goodware data 102. Next, the dimension reduction unit 32 compresses the dimension of the feature vector extracted by the feature extraction unit 31 (step S202). Then, the classifier 33 learns the teacher data 100 (step S203).

次に、特徴抽出部３１は、誤検知グッドウェア候補データ１０５から特徴ベクトルを抽出する（ステップＳ２０４）。そして、次元削減部３２は、特徴抽出部３１で抽出した特徴ベクトルの次元の圧縮を行う（ステップＳ２０５）。そして、分類器３３は、特徴ベクトルから、誤検知グッドウェア候補データ１０５のマルウェアらしさのスコアを算出し（ステップＳ２０６）、判定部３４は、スコア付き判定結果１０６を出力する（ステップＳ２０７）。最後に、誤検知グッドウェア選出部４０は、誤検知グッドウェア候補データ１０５から、高スコアのグッドウェアの順に抽出し、誤検知グッドウェアデータ１１０として出力する（ステップＳ２０８）。 Next, the feature extraction unit 31 extracts a feature vector from the erroneous detection goodware candidate data 105 (step S204). Then, the dimension reduction unit 32 compresses the dimension of the feature vector extracted by the feature extraction unit 31 (step S205). Then, the classifier 33 calculates a score of the likelihood of malware of the erroneous detection goodware candidate data 105 from the feature vector (step S206), and the determination unit 34 outputs the scored determination result 106 (step S207). Finally, the erroneous detection goodware selection unit 40 extracts the high-score goodware from the erroneous detection goodware candidate data 105 in the order of high-score goodware data 110 and outputs it as the erroneous detection goodware data 110 (step S208).

［第２の実施形態の効果］
第２の実施形態においては、第１の実施形態のマルウェア判定システム１の構成に、誤検知グッドウェアデータ出力装置３０が追加されている。誤検知グッドウェアデータ出力装置３０は、教師データ１００と、誤検知グッドウェアデータ１１０の候補である誤検知グッドウェア候補データ１０５と、の特徴を特徴ベクトルとして抽出し、教師データ１００の特徴ベクトルを学習し、誤検知グッドウェア候補データ１０５の特徴ベクトルから誤検知グッドウェア候補データ１０５のマルウェアらしさを示すスコアを算出し、誤検知グッドウェア候補データ１０５のうち、スコアが所定の閾値より大きいものを抽出し誤検知グッドウェアデータとして出力する。これにより、誤検知グッドウェアデータ１１０を適切かつ効率的に選別し、生成することができる。 [Effects of Second Embodiment]
In the second embodiment, a false detection goodware data output device 30 is added to the configuration of the malware determination system 1 of the first embodiment. The false detection goodware data output device 30 extracts features of the teacher data 100 and the false detection goodware candidate data 105 that is a candidate of the false detection goodware data 110 as feature vectors, and the feature vector of the teacher data 100 is extracted. Learning, calculating a score indicating the likelihood of malware of the false positive goodware candidate data 105 from the feature vector of the false positive goodware candidate data 105, and calculating the score of the false positive goodware candidate data 105 that has a score greater than a predetermined threshold Extracted and output as false detection goodware data. Thereby, the erroneous detection goodware data 110 can be selected and generated appropriately and efficiently.

［その他の実施形態］
本発明は、上記の実施形態に限定されるものではなく、例えば、精査教師データ出力装置１０の特徴抽出部１１および次元削減部１２は、学習・判定装置２０の特徴抽出部２１および次元削減部２２と同一のものであってもよいし、異なるものであってもよい。例えば、類似マルウェア除去部１３での類似度としてJaccard係数を用いる場合は、次元圧縮を行うと類似度の算出が行えないため、次元削減部１２では次元圧縮を行わず、次元削減部２２でのみ次元圧縮を行う等、異なる処理を行ってもよい。 [Other Embodiments]
The present invention is not limited to the above-described embodiment. For example, the feature extraction unit 11 and the dimension reduction unit 12 of the scrutinizing teacher data output device 10 are the feature extraction unit 21 and the dimension reduction unit of the learning / determination device 20. 22 may be the same or different. For example, when a Jaccard coefficient is used as the similarity in the similar malware removal unit 13, since the similarity cannot be calculated if dimension compression is performed, the dimension reduction unit 12 does not perform dimension compression, and only the dimension reduction unit 22 performs the calculation. Different processing such as dimensional compression may be performed.

また、事例選択において、類似マルウェア除去部１３の次に誤検知グッドウェア挿入部１４で処理する例を説明したが、順序を逆にして、誤検知グッドウェア挿入部１４の次に類似マルウェア除去部１３で処理を行ってもよい。また、必ずしも両処理を実施するのではなく、類似マルウェア除去部１３の処理のみ、あるいは、誤検知グッドウェア挿入部１４の処理のみを実施するようにしてもよい。 Moreover, in the case selection, the example in which the error detection goodware insertion unit 14 performs processing after the similar malware removal unit 13 has been described. However, the order is reversed, and the similar malware removal unit next to the error detection goodware insertion unit 14. Processing may be performed in step 13. In addition, both processes are not necessarily performed, but only the process of the similar malware removing unit 13 or the process of the false detection goodware insertion unit 14 may be performed.

また、第１の実施形態においては、マルウェア判定システム１に、精査教師データ出力装置１０および学習・判定装置２０が含まれる構成としたが、マルウェア判定装置が精査教師データ出力装置１０および学習・判定装置２０の機能を有する構成としてもよい。図８は、その他の実施形態に係るマルウェア判定装置の構成の一例を示す図である。図８に示すように、マルウェア判定装置１ａは、精査教師データ出力装置１０と同様の機能を有する精査教師データ出力部１０ａおよび学習・判定装置２０と同様の機能を有する学習・判定部２０ａから構成される。 In the first embodiment, the malware determination system 1 includes the scrutinizing teacher data output device 10 and the learning / determination device 20. However, the malware determination device has the scrutinizing teacher data output device 10 and the learning / determination device. It is good also as a structure which has the function of the apparatus 20. FIG. FIG. 8 is a diagram illustrating an example of a configuration of a malware determination device according to another embodiment. As shown in FIG. 8, the malware determination device 1 a includes a scrutinizing teacher data output unit 10 a having the same function as the scrutinizing teacher data output device 10 and a learning / determination unit 20 a having the same function as the learning / determination device 20. Is done.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵ（Central Processing Unit）および当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. Further, all or any part of each processing function performed in each device is realized by a CPU (Central Processing Unit) and a program analyzed and executed by the CPU, or hardware by wired logic. Can be realized as

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Also, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
図９は、プログラムが実行されることにより、マルウェア判定装置１ａや精査教師データ出力装置１０が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 [program]
FIG. 9 is a diagram illustrating an example of a computer that realizes the malware determination device 1a and the scrutinizing teacher data output device 10 by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to the display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、マルウェア判定装置１ａや精査教師データ出力装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、マルウェア判定装置１ａや精査教師データ出力装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the malware determination device 1a and the scrutinizing teacher data output device 10 is implemented as a program module 1093 in which a code executable by a computer is described. The program module 1093 is stored in the hard disk drive 1090, for example. For example, a program module 1093 for executing processing similar to the functional configuration in the malware determination device 1a and the scrutinizing teacher data output device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 and executes them as necessary.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３およびプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３およびプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

１マルウェア判定システム
１ａマルウェア判定装置
１０精査教師データ出力装置
１０ａ精査教師データ出力部
１１、２１、３１特徴抽出部
１２、２２、３２次元削減部
１３類似マルウェア除去部
１４誤検知グッドウェア挿入部
２０学習・判定装置
２０ａ、３０ａ学習・判定部
２３、３３分類器
２４、３４判定部
３０誤検知グッドウェアデータ出力装置
４０誤検知グッドウェア選出部
１００教師データ
１０１既存マルウェアデータ
１０２既存グッドウェアデータ
１０５誤検知グッドウェア候補データ
１０６スコア付き判定結果
１１０誤検知グッドウェアデータ
１２０精査教師データ
１３０対象ファイル
１４０判定結果 DESCRIPTION OF SYMBOLS 1 Malware determination system 1a Malware determination apparatus 10 Scrutinizing teacher data output apparatus 10a Scrutinizing teacher data output part 11, 21, 31 Feature extraction part 12, 22, 32 Dimension reduction part 13 Similar malware removal part 14 False detection goodware insertion part 20 Learning -Judgment device 20a, 30a Learning / determination unit 23, 33 Classifier 24, 34 Judgment unit 30 False detection goodware data output device 40 False detection goodware selection unit 100 Teacher data 101 Existing malware data 102 Existing goodware data 105 False detection Goodware candidate data 106 Judgment result with score 110 False detection goodware data 120 Examination teacher data 130 Target file 140 Judgment result

Claims

Existing malware data that is known to be malignant, existing Goodware data that is known to be benign, and false positive that is known to be benign but determined to be malignant A feature extraction unit for extracting each feature of the wear data as a feature vector,
A similar malware removing unit that deletes existing malware data whose feature vector is similar to the feature vector of the false positive goodware data from the teacher data including the existing malware data and the existing goodware data;
The erroneous detection goodware data, the erroneous detection goodware insertion unit for inserting into the teacher data as the existing goodware data,
A scrutinizing teacher data output device for malware determination, characterized by comprising:

The erroneous detection goodware insertion part is
According to the type of classifier that learns the feature vector of the teacher data, it is determined in which position of the teacher data to insert the erroneous detection goodware data, and the erroneous detection goodware is determined at the determined position. The scrutinizing teacher data output device for malware determination according to claim 1, wherein data is inserted.

The erroneous detection goodware insertion part is
The scrutinizing teacher data output device for malware determination according to claim 1, wherein the erroneously detected goodware data is inserted at a head or a tail of the teacher data.

The erroneous detection goodware insertion part is
The scrutinizing teacher data output device for malware determination according to any one of claims 1 to 3, wherein a plurality of data identical to the erroneously detected goodware data is inserted into the teacher data.

Existing malware data that is known to be malignant, existing Goodware data that is known to be benign, and false positive that is known to be benign but determined to be malignant The feature vector of the false detection goodware data is similar to the feature vector from the first feature extraction unit that extracts each feature of the wear data as a feature vector, and the teacher data including the existing malware data and the existing goodware data. A similar malware removal unit that deletes existing malware data, and a false detection goodware insertion unit that inserts the erroneous detection goodware data into the teacher data as the existing goodware data When,
Second feature extraction for extracting each feature of a target file as a target for determining whether or not it is malware and teacher data into which the erroneously detected goodware data is inserted by the erroneously detected goodware insertion unit A first classifier that learns a feature vector of the teacher data and calculates a score indicating the malware likeness of the target file from the feature vector of the target file, and the target file is malware based on the score A determination unit for determining whether or not there is a learning determination device,
A malware determination system characterized by comprising:

A third feature extraction unit that extracts each feature of the candidate data of false detection goodware, which is a candidate of the teacher data and the false detection goodware data, as a feature vector; learning the feature vector of the teacher data; and A second classifier that calculates a malware-like score of the erroneously detected goodware candidate data from the feature vector of the detected goodware candidate data; and the score of the erroneously detected goodware candidate data is greater than a predetermined threshold The malware determination system according to claim 5, further comprising a false detection goodware data output device having a false detection goodware extraction unit that extracts a thing and outputs it as the false detection goodware data.

A malware determination method executed by a malware determination system,
Existing malware data that is known to be malignant, existing Goodware data that is known to be benign, and false positive that is known to be benign but determined to be malignant A first feature extraction step of extracting each feature of the wear data as a feature vector,
A similar malware removal step of deleting existing malware data whose feature vector is similar to the feature vector of the false positive goodware data from the teacher data including the existing malware data and the existing goodware data;
The erroneous detection goodware insertion step of inserting the erroneous detection goodware data into the teacher data as the existing goodware data;
Second feature extraction for extracting each feature of a target file as a target for determining whether or not it is malware and teacher data into which the erroneously detected goodware data has been inserted in the erroneously detected goodware insertion step Process,
A classification step of learning a feature vector of the teacher data and calculating a score indicating the malware likeness of the target file from the feature vector of the target file;
A determination step of determining whether the target file is malware based on the score;
Malware determination method characterized by including.

A scrutinizing teacher data output program for malware determination for causing a computer to function as the scrutinizing teacher data output device for malware determination according to any one of claims 1 to 4.