JP6039768B1

JP6039768B1 - ADJUSTMENT DEVICE, ADJUSTMENT METHOD, AND ADJUSTMENT PROGRAM

Info

Publication number: JP6039768B1
Application number: JP2015159508A
Authority: JP
Inventors: 靖岡野; 充敏熊谷; 谷川　真樹; 真樹谷川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-08-12
Filing date: 2015-08-12
Publication date: 2016-12-07
Anticipated expiration: 2035-08-12
Also published as: JP2017037555A

Abstract

【課題】マルウェア判定を行うための精度向上手法を組み合わせる場合の、各精度向上手法における調整事項の調整を効率良く行う。【解決手段】ホワイトリストの適用、特徴抽出および次元削減の順で前処理を行い、ファイルがマルウェアであるか否かの判定を閾値によって行うマルウェア判定装置２０に対して、調整装置１０は、スコア算出処理、特徴抽出および次元削減、ホワイトリストの適用および判定処理の順で、各処理に設定の候補の中から設定を適用していき、マルウェア判定精度を分析する。このとき、最適な設定が既に決定された処理については、以降は決定された最適な処理を適用し、分析を行う。【選択図】図１[PROBLEMS] To efficiently adjust adjustment items in each accuracy improvement method when combining accuracy improvement methods for performing malware determination. In contrast to a malware determination device that performs preprocessing in the order of whitelist application, feature extraction, and dimension reduction, and determines whether a file is malware or not using a threshold value, the adjustment device has a score In the order of calculation processing, feature extraction and dimension reduction, white list application and determination processing, settings are applied from among the setting candidates to each processing, and malware determination accuracy is analyzed. At this time, for the process for which the optimal setting has already been determined, the determined optimal process is applied and analyzed thereafter. [Selection] Figure 1

Description

本発明は、マルウェア判定装置の設定を調整する調整装置、調整方法および調整プログラムに関する。 The present invention relates to an adjustment device, an adjustment method, and an adjustment program for adjusting settings of a malware determination device.

MS Windows（登録商標）、Apple OSX（登録商標）、Linux（登録商標）およびその他Unix（登録商標）等のＯＳ上で用いられる実行ファイルがマルウェアであるか否かを判定するアンチウィルスシステムが知られている。アンチウィルスシステムでは、実行ファイルを実行して判定を行う動的判定、および実行ファイルを実行せずに判定を行う静的判定の２手法が用いられ、判定に特に速度が求められる場合は静的判定が用いられる。 Anti-virus systems that determine whether executable files used on OS such as MS Windows (registered trademark), Apple OSX (registered trademark), Linux (registered trademark) and other Unix (registered trademark) are malware are known It has been. In the anti-virus system, two methods are used: dynamic determination in which an execution file is executed and determination is performed, and static determination in which determination is performed without executing the execution file. Judgment is used.

代表的な静的判定の手法として、ハッシュ値一致判定、およびパターンマッチ判定（シグニチャスキャン）が挙げられる。ハッシュ値一致判定は、あらかじめ既知のマルウェアのＭＤ５、ＳＨＡ１、ＳＨＡ２５６等のハッシュ値をデータベースとして持ち、検査対象ファイルのハッシュ値がそのデータベースに合致すればマルウェアと判定するものである。また、パターンマッチ判定は、あらかじめ既知のマルウェアに含まれる特定の文字列やバイトコードをデータベースとして持ち、検査対象ファイルがデータベースに登録された文字列・バイトコードのいずれかを含んでいればマルウェアと判定するものである。 Typical static determination methods include hash value match determination and pattern match determination (signature scan). The hash value match determination has a hash value of MD5, SHA1, SHA256, etc. of known malware in advance as a database, and is determined as malware if the hash value of the inspection target file matches the database. In addition, the pattern matching judgment has a specific character string or byte code included in known malware in advance as a database, and if the file to be inspected contains either a character string or byte code registered in the database, Judgment.

しかし、マルウェアが少し改造されるだけで、ハッシュ値は異なった値となってしまい、パターンマッチ判定で用いる特定の文字列等も変更されてしまう可能性がある。そのため、これらの手法では、既存マルウェアを改造した亜種のマルウェアや新種のマルウェアの検知は難しかった。そこで、亜種・新種のマルウェアを判定する手法として、ヒューリスティック判定が提案されている。これは、これまでの経験に基づいて、マルウェアらしさを定義し、その定義に従って判定を行うものである。 However, even if the malware is modified a little, the hash value becomes a different value, and there is a possibility that a specific character string or the like used for pattern match determination is also changed. For this reason, it has been difficult for these methods to detect variants of new malware and variants of existing malware. Therefore, heuristic determination has been proposed as a method for determining sub-species and new types of malware. In this method, malware-likeness is defined based on previous experience, and a determination is made according to the definition.

ヒューリスティック判定として機械学習技術を用いる手法がいくつか提案されている。例えば、実行ファイル中に含まれる可読文字列をあらかじめ学習し、マルウェアで良く用いられる語が検査ファイル中にどの程度含まれるかを基準にそのマルウェアらしさを判定する方法が提案されている（例えば特許文献１を参照）。 Several methods using machine learning techniques for heuristic determination have been proposed. For example, a method has been proposed in which a readable character string included in an executable file is learned in advance, and the likelihood of the malware is determined based on how many words frequently used in malware are included in the inspection file (for example, a patent) Reference 1).

機械学習では、まず学習対象のデータ（教師データ）を、いくつかのパラメータの組に変換してから機械学習アルゴリズムで学習を行う。個々のパラメータを特徴、パラメータの組を特徴ベクトルと呼ぶ。例えば、特許文献１の例では、単語名とその出現数が特徴であり、その組が特徴ベクトルである。また、その組に含まれる特徴の個数を特徴ベクトル次元と呼び、前述の例では単語の種類数が特徴ベクトル次元となる。 In machine learning, first, learning target data (teacher data) is converted into a set of several parameters, and then learning is performed using a machine learning algorithm. Each parameter is called a feature, and a set of parameters is called a feature vector. For example, in the example of Patent Document 1, a word name and the number of appearances are features, and the set is a feature vector. In addition, the number of features included in the set is referred to as a feature vector dimension. In the above example, the number of types of words is a feature vector dimension.

また、機械学習技術を用いたヒューリスティック判定は、亜種・新種のマルウェアも判定可能となる一方で、誤検知率（マルウェアでないファイル（グッドウェア）を間違えてマルウェアと判定してしまう率）が比較的大きい傾向がある。そこで、機械学習技術において、様々な精度向上手法が提案されている。例えば、実行ファイルのＰＥヘッダ情報を用いた機械学習によるマルウェア判定方法が提案されており、この方法では、適切な次元圧縮と機械学習アルゴリズムを用いることにより、検知精度を向上させている（例えば非特許文献１を参照）。 In addition, heuristic determination using machine learning technology can also detect subtypes and new types of malware, while comparing the false positive rate (the rate at which non-malware files (goodware) are mistakenly determined as malware). There is a big tendency. Therefore, various accuracy improvement methods have been proposed in machine learning technology. For example, a malware determination method by machine learning using PE header information of an executable file has been proposed, and in this method, detection accuracy is improved by using an appropriate dimension compression and a machine learning algorithm (for example, non-deletion). (See Patent Document 1).

また、分類精度を悪化させると思われるデータを教師データから除去する等、教師データの精査を行う手法である事例選択によっても精度を向上させることができる（例えば非特許文献２を参照）。非特許文献２では、サポートベクターマシン（ＳＶＭ）を用いた画像分類において、ＳＶＭの内部パラメータαｉを用い、分類しにくい曖昧な画像データを抽出し、抽出したデータを教師データから取り除く事例選択手法が用いられている。 The accuracy can also be improved by case selection, which is a technique for examining teacher data, such as removing data that seems to deteriorate the classification accuracy from the teacher data (see Non-Patent Document 2, for example). In Non-Patent Document 2, in image classification using a support vector machine (SVM), there is a case selection method that uses internal parameters αi of SVM to extract ambiguous image data that is difficult to classify and removes the extracted data from teacher data. It is used.

また、用いた機械学習アルゴリズムのパラメータ調整で精度を向上させる方法が一般的に実施されている（例えば非特許文献３を参照）。また、誤検知を判定時に訂正する一般的な手法として、ホワイトリストを用いた誤検知訂正手法もよく用いられる。 In addition, a method for improving accuracy by adjusting parameters of the used machine learning algorithm is generally implemented (see, for example, Non-Patent Document 3). Further, as a general technique for correcting erroneous detection at the time of determination, a false detection correction technique using a white list is often used.

特開２０１２−２７７１０号公報JP 2012-27710 A

Shafiq, et al., “PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime”, RAID '09, 2009.Shafiq, et al., “PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime”, RAID '09, 2009. 高取等、「サポートベクターマシンの内部パラメータに基づく事例選択手法の提案と映像境界検出問題への応用」、人工知能学会全国大会（第22回）、2008.Takatori et al., “Proposal of case selection method based on internal parameters of support vector machine and application to video boundary detection problem”, Japanese Society for Artificial Intelligence (22nd), 2008. Hsu, et al., “A Practical Guide to Support Vector Classification”, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdfHsu, et al., “A Practical Guide to Support Vector Classification”, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

機械学習を用いたマルウェア判定において、様々な精度向上手法を組み合わせれば、個々の手法だけを用いるよりも判定精度を大きく向上させることが可能である。ここで、各精度向上手法にはそれぞれに多様な調整事項が存在する。しかしながら、従来技術においては、精度向上手法の調整事項の調整を効率良く行うことができないという問題があった。 In malware determination using machine learning, if various accuracy improvement methods are combined, the determination accuracy can be greatly improved as compared to using only individual methods. Here, each accuracy improvement method has various adjustment items. However, in the prior art, there is a problem that adjustment items of the accuracy improvement method cannot be adjusted efficiently.

すなわち、様々な精度向上手法を組み合わせて、各手法における調整事項の最適な設定を得るために、各精度向上手法と各調整事項の設定の組み合せを、手法横断的かつ網羅的に試行する必要があり、効率的な調整が行えない。 In other words, in order to obtain the optimum setting of adjustment items for each method by combining various accuracy improvement methods, it is necessary to try each combination of accuracy improvement method and each adjustment item comprehensively and comprehensively. There is no efficient adjustment.

網羅的に試行する調整の一例として、機械学習アルゴリズムのパラメータを調整することを考える。機械学習アルゴリズムのパラメータ調整手法の例として、非特許文献２に記載されているグリッドサーチ法がある。グリッドサーチ法は、アルゴリズムパラメータの候補値をいくつか挙げておき、その候補値を１つ１つ当てはめて、実際に学習・判定を行って試行し、最も良い精度を示した候補値を採用する方法である。複数のパラメータがある場合、パラメータそれぞれにおいて候補値を挙げておき、各パラメータの候補値を総当たりで組み合わせて試行する。例えば、２つのパラメータがあり、それぞれ１０個ずつ候補値を用意した場合、その試行回数は１０×１０＝１００回となる。 As an example of adjustment that is exhaustively tried, consider adjusting parameters of a machine learning algorithm. As an example of the parameter adjustment method of the machine learning algorithm, there is a grid search method described in Non-Patent Document 2. In the grid search method, several candidate values of algorithm parameters are listed, each of the candidate values is applied one by one, actually subjected to learning / determination, and the candidate value showing the best accuracy is adopted. Is the method. When there are a plurality of parameters, candidate values are listed for each parameter, and the candidate values for each parameter are combined and tried. For example, when there are two parameters and ten candidate values are prepared for each, the number of trials is 10 × 10 = 100.

また、いくつかの特徴を取り除くことで次元を削減し、マルウェアの判定精度を向上させる特徴選択という手法を調整することも考えられる。例えば、特徴選択の調整方法として、ラッパー法がある（例えば特願２０１４−１２０４２８参照）。ラッパー法は、特徴選択の設定の候補について、実際に学習・判定を行って試行し、最も良い精度を示した特徴選択の設定を採用する方法である。 It is also possible to adjust the technique of feature selection that reduces the dimension by removing some features and improves the accuracy of malware determination. For example, as a feature selection adjustment method, there is a wrapper method (see, for example, Japanese Patent Application No. 2014-120428). The wrapper method is a method that employs the feature selection setting that shows the best accuracy by actually learning and determining the feature selection setting candidates.

ここで、特徴選択の設定の候補を作成する方法として、変数増加法がある。変数増加法では、特徴（あるいは属性）の数をｐとすると、その試行は最小で２ｐ−１回、最大でｐ×（ｐ−１）／２回となる。アルゴリズムパラメータと特徴選択から最良の精度のものを求める場合、両候補値の組み合せの総当たりを網羅的に試行する必要がある。 Here, there is a variable increasing method as a method for creating feature selection setting candidates. In the variable increase method, if the number of features (or attributes) is p, the trial is 2p-1 times at the minimum and p × (p-1) / 2 times at the maximum. When finding the best accuracy from algorithm parameters and feature selection, it is necessary to exhaustively try the brute force combination of both candidate values.

仮に調整すべきアルゴリズムパラメータが２個で、その候補値の個数を各１０個、選択すべき特徴の個数が１０個である場合、１０×１０×１９〜１０×１０×４５、すなわち１９００〜４５００回の試行が必要となる。このように、網羅的な試行は調整すべき手法・パラメータが増加するごとに、指数関数的にその試行回数が増え、効率的に調整を行うことができなくなる。 If the number of algorithm parameters to be adjusted is two, the number of candidate values is 10, and the number of features to be selected is 10, 10 × 10 × 19 to 10 × 10 × 45, that is, 1900 to 4500. Times are required. As described above, every time the method / parameter to be adjusted increases, the number of trials increases exponentially and the adjustment cannot be performed efficiently.

また、その他の精度向上手法として、既知の誤検知グッドウェアを用いて類似のマルウェアのデータを教師データから除去すること、および、より学習されやすいようにその誤検知グッドウェアのデータを教師データに適切に挿入することで、特に低誤検知率を実現する事例選択手法が考えられる（例えば特願２０１５−０８７９２４参照）。しかしながら、他の手法と同様に、事例選択手法の調整事項の調整についても網羅的に試行していたため、効率良く調整を行うことができない場合があった。 Another method for improving accuracy is to use known false positive goodware to remove similar malware data from the teacher data, and make the false positive goodware data into teacher data so that it can be more easily learned. A case selection method that realizes a particularly low false detection rate by appropriately inserting can be considered (for example, see Japanese Patent Application No. 2015-087924). However, as with other methods, since adjustment of adjustment items in the case selection method has been exhaustively tried, there are cases where adjustment cannot be performed efficiently.

本発明の調整装置は、所定の順序で１つ以上の前処理を実行した後、分類器によってファイルのスコアを算出するスコア算出処理を実行し、算出したスコアに基づいて該ファイルがマルウェアであるか否かの判定処理を行うマルウェア判定装置における、各処理を実行するための最適な設定を決定する調整装置であって、前記マルウェア判定装置の各処理を所定の順序で選択する指示部と、前記指示部によって処理が選択されるたびに、該選択された処理を実行するための設定の候補を順次適用し、また、前記指示部によって選択された処理以外の処理のうち最適な設定が決定済みである処理に、該最適な設定を適用する設定適用部と、前記設定適用部によって設定の適用が行われるたびに、前記設定適用部によって適用された設定にしたがって前記マルウェア判定装置の各処理を実行した場合の、前記設定の候補のそれぞれに対応する結果を取得する検証部と、前記検証部によって前記選択された処理の前記設定の候補のすべてについて対応する結果が取得されるたびに、前記設定の候補のうち、前記設定の候補のそれぞれに対応する結果からマルウェアの判定精度が最も高くなると判定された設定の候補を、前記選択した処理の最適な設定として決定する分析部と、を有することを特徴とする。 The adjustment device of the present invention executes one or more pre-processes in a predetermined order, and then executes a score calculation process for calculating a score of a file by a classifier, and the file is malware based on the calculated score An adjustment unit that determines an optimal setting for executing each process in the malware determination apparatus that performs the determination process whether or not, and an instruction unit that selects each process of the malware determination apparatus in a predetermined order; Each time a process is selected by the instruction unit, the setting candidates for executing the selected process are sequentially applied, and an optimum setting is determined among processes other than the process selected by the instruction unit. A setting application unit that applies the optimal setting to the process that has been completed, and the setting applied by the setting application unit every time the setting application unit applies the setting. When each process of the malware determination device is executed, a verification unit that acquires a result corresponding to each of the setting candidates, and all of the setting candidates of the process selected by the verification unit Each time a result is acquired, a setting candidate that is determined to have the highest malware determination accuracy based on a result corresponding to each of the setting candidates is selected as the optimum setting for the selected process. And an analysis unit determined as:

本発明の調整方法は、所定の順序で１つ以上の前処理を実行した後、分類器によってファイルのスコアを算出するスコア算出処理を実行し、算出したスコアに基づいて該ファイルがマルウェアであるか否かの判定処理を行うマルウェア判定装置における、各処理を実行するための最適な設定を決定する調整方法であって、前記マルウェア判定装置の各処理を所定の順序で選択する指示工程と、前記指示工程によって処理が選択されるたびに、該選択された処理を実行するための設定の候補を順次適用し、また、前記指示工程によって選択された処理以外の処理のうち最適な設定が決定済みである処理に、該最適な設定を適用する設定適用工程と、前記設定適用工程によって設定の適用が行われるたびに、前記設定適用工程によって適用された設定にしたがって前記マルウェア判定装置の各処理を実行した場合の、前記設定の候補のそれぞれに対応する結果を取得する検証工程と、前記検証工程によって前記選択された処理の前記設定の候補のすべてについて対応する結果が取得されるたびに、前記設定の候補のうち、前記設定の候補のそれぞれに対応する結果からマルウェアの判定精度が最も高くなると判定された設定の候補を、前記選択した処理の最適な設定として決定する分析工程と、を含んだことを特徴とする。 According to the adjustment method of the present invention, after executing one or more pre-processes in a predetermined order, a score calculation process for calculating a score of a file is executed by a classifier, and the file is malware based on the calculated score An adjustment method for determining an optimal setting for executing each process in a malware determination apparatus that performs a determination process on whether or not, and an instruction step of selecting each process of the malware determination apparatus in a predetermined order; Each time a process is selected by the instruction process, the setting candidates for executing the selected process are sequentially applied, and an optimal setting is determined among processes other than the process selected by the instruction process. Applied to the process that has already been applied, each time the setting is applied by the setting application step, and the setting application step applies the optimum setting. A verification step of obtaining results corresponding to each of the setting candidates when each processing of the malware determination device is executed according to the setting, and all of the setting candidates of the processing selected by the verification step Each time a corresponding result is obtained, the setting candidate determined to have the highest malware determination accuracy from the result corresponding to each of the setting candidates is selected as the optimum of the selected process. And an analysis step for determining as a proper setting.

本発明によれば、マルウェア判定を行うための精度向上手法を組み合わせる場合に、各精度向上手法における調整事項の調整を効率良く行うことができる。 ADVANTAGE OF THE INVENTION According to this invention, when combining the accuracy improvement method for performing malware determination, the adjustment matter in each accuracy improvement method can be adjusted efficiently.

図１は、第１の実施形態に係る調整装置およびマルウェア判定装置を含んだマルウェア判定システムの構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a configuration of a malware determination system including an adjustment device and a malware determination device according to the first embodiment. 図２は、第１の実施形態に係る調整装置の設定格納部のデータの一例を示す図である。FIG. 2 is a diagram illustrating an example of data in the setting storage unit of the adjustment device according to the first embodiment. 図３は、第１の実施形態に係る調整装置の各調整工程を説明するための図である。Drawing 3 is a figure for explaining each adjustment process of the adjustment device concerning a 1st embodiment. 図４は、スコア閾値と、検知率および誤検知率との関係を説明するための図である。FIG. 4 is a diagram for explaining the relationship between the score threshold, the detection rate, and the false detection rate. 図５は、ＲＯＣ曲線について説明するための図である。FIG. 5 is a diagram for explaining the ROC curve. 図６は、第１の実施形態に係る調整装置の処理を示すフローチャートである。FIG. 6 is a flowchart illustrating the process of the adjustment device according to the first embodiment. 図７は、第１の実施形態に係るマルウェア判定装置の処理を示すフローチャートである。FIG. 7 is a flowchart showing processing of the malware determination device according to the first embodiment. 図８は、第２の実施形態に係る調整装置およびマルウェア判定装置を含んだマルウェア判定システムの構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of a configuration of a malware determination system including the adjustment device and the malware determination device according to the second embodiment. 図９は、プログラムが実行されることにより、調整装置が実現されるコンピュータの一例を示す図である。FIG. 9 is a diagram illustrating an example of a computer in which the adjustment apparatus is realized by executing a program.

以下に、本願に係る調整装置、調整方法および調整プログラムの実施形態を図面に基づいて詳細に説明する。なお、この実施形態により本願に係る調整装置、調整方法および調整プログラムが限定されるものではない。 Hereinafter, embodiments of an adjustment device, an adjustment method, and an adjustment program according to the present application will be described in detail with reference to the drawings. Note that the adjustment device, the adjustment method, and the adjustment program according to the present application are not limited by this embodiment.

［第１の実施形態］
以下の実施形態では、第１の実施形態に係る調整装置およびマルウェア判定装置を含むマルウェア判定システムの構成、処理および効果について説明する。調整装置１０は、所定の順序で１つ以上の前処理を実行した後、分類器によってファイルのスコアを算出するスコア算出処理を実行し、算出したスコアに基づいて該ファイルがマルウェアであるか否かの判定処理を行うマルウェア判定装置２０における、各処理を実行するための最適な設定を決定する。 [First Embodiment]
In the following embodiments, the configuration, processing, and effects of a malware determination system including the adjustment device and the malware determination device according to the first embodiment will be described. After executing one or more pre-processes in a predetermined order, the adjustment apparatus 10 executes a score calculation process for calculating a score of the file by the classifier, and whether or not the file is malware based on the calculated score The optimal setting for executing each process in the malware determination apparatus 20 that performs the determination process is determined.

［第１の実施形態の構成］
図１を用いて、第１の実施形態に係る調整装置およびマルウェア判定装置を含んだマルウェア判定システムの構成について説明する。図１は、第１の実施形態に係る調整装置およびマルウェア判定装置を含んだマルウェア判定システムの構成の一例を示す図である。まず、マルウェア判定システム１におけるデータの流れについて説明する。図１に示すように、まず、調整装置１０には、ユーザ等による指示、機械学習のための原教師データ、調整および検証のための調整用データおよび原検証用データが入力される。 [Configuration of First Embodiment]
The configuration of the malware determination system including the adjustment device and the malware determination device according to the first embodiment will be described with reference to FIG. FIG. 1 is a diagram illustrating an example of a configuration of a malware determination system including an adjustment device and a malware determination device according to the first embodiment. First, the data flow in the malware determination system 1 will be described. As shown in FIG. 1, first, an instruction from a user, original teacher data for machine learning, adjustment data for adjustment and verification, and original verification data are input to the adjustment device 10.

調整装置１０は、これらのデータを基に、調整済教師データおよび調整済設定を作成し、マルウェア判定装置２０に対して出力する。そして、マルウェア判定装置２０は、調整装置１０から取得した各データを用いて、別途入力された判定対象ファイルの判定結果を出力する。ここで、調整装置１０の構成および処理の一部は、マルウェア判定装置２０の構成等によって決定される。そのため、まずマルウェア判定装置２０の構成について説明する。 The adjustment device 10 creates adjusted teacher data and adjusted settings based on these data, and outputs them to the malware determination device 20. And the malware determination apparatus 20 outputs the determination result of the determination object file input separately using each data acquired from the adjustment apparatus 10. FIG. Here, a part of the configuration and processing of the adjustment device 10 is determined by the configuration of the malware determination device 20 and the like. Therefore, first, the configuration of the malware determination device 20 will be described.

［マルウェア判定装置］
図１に示すように、マルウェア判定装置２０は、ホワイトリスト適用部２０１、特徴抽出・次元削減部２０２、分類器２０３、判定部２０４および設定格納部２１０を有する。マルウェア判定装置２０は、まず設定格納部２１０に格納されている設定等を読み込み初期化した後、マルウェアとマルウェアでない実行ファイル（グッドウェア）から構成された教師データを学習する。マルウェア判定装置２０は、教師データの学習後、判定対象ファイルがマルウェアであるかグッドウェアであるかを判定する。なお、教師データの学習および判定対象ファイルの判定には、ホワイトリスト適用部２０１、特徴抽出・次元削減部２０２、分類器２０３、判定部２０４が適宜用いられる。 [Malware determination device]
As illustrated in FIG. 1, the malware determination device 20 includes a white list application unit 201, a feature extraction / dimension reduction unit 202, a classifier 203, a determination unit 204, and a setting storage unit 210. The malware determination device 20 first reads and initializes settings stored in the setting storage unit 210, and then learns teacher data composed of malware and an executable file (goodware) that is not malware. The malware determination device 20 determines whether the determination target file is malware or goodware after learning the teacher data. Note that the white list application unit 201, the feature extraction / dimension reduction unit 202, the classifier 203, and the determination unit 204 are used as appropriate for learning the teacher data and determining the determination target file.

前述の通り、マルウェア判定装置２０は学習処理を実施した後に判定処理を実施する。まず、学習処理における各部の機能について説明する。マルウェア判定装置２０は、特徴抽出・次元削減部２０２による特徴抽出および次元削減、分類器２０３によるスコア算出の順で学習処理を実行する。学習対象である教師データは、既存のマルウェアとグッドウェアの実行ファイル、および実行ファイルがマルウェアであるかグッドウェアであるかの分類を示す情報で構成される。 As described above, the malware determination device 20 performs the determination process after performing the learning process. First, the function of each part in the learning process will be described. The malware determination device 20 executes learning processing in the order of feature extraction and dimension reduction by the feature extraction / dimension reduction unit 202 and score calculation by the classifier 203. The teacher data to be learned is composed of existing malware and goodware executable files, and information indicating the classification of whether the executable file is malware or goodware.

特徴抽出・次元削減部２０２は、教師データから特徴抽出を行う。さらに、特徴抽出・次元削減部２０２は、必要に応じて抽出した特徴に対して特徴選択や次元圧縮による次元削減を行い、特徴ベクトルを生成する。特徴抽出・次元削減部２０２が特徴ベクトルに重み付けを行うための手法の例として、例えばtf-idf（Term Frequency - Inverse Document Frequency）が挙げられる。 The feature extraction / dimension reduction unit 202 performs feature extraction from the teacher data. Furthermore, the feature extraction / dimension reduction unit 202 performs feature reduction or feature reduction on the extracted features as necessary to generate feature vectors. An example of a technique for weighting feature vectors by the feature extraction / dimension reduction unit 202 is, for example, tf-idf (Term Frequency-Inverse Document Frequency).

ここで、特徴選択とは、目的の精度がより良くなるように各特徴を取捨選択する手法である。また、次元圧縮の代表的な手法として、例えば相関がある特徴同士を自動的に１つの特徴にまとめる主成分分析（ＰＣＡ）が知られている。なお、次元削減においては、特徴選択と次元圧縮のどちらか一方のみを行ってもよいし、両方を行ってもよい。 Here, the feature selection is a method of selecting each feature so that the target accuracy is improved. As a typical technique for dimensional compression, for example, principal component analysis (PCA) is known in which correlated features are automatically combined into one feature. In the dimension reduction, only one of feature selection and dimension compression may be performed, or both may be performed.

分類器２０３は、特徴ベクトルと、特徴ベクトルと対応する実行ファイルがマルウェアであるかグッドウェアであるかの分類を示す情報を用いて機械学習を行う。分類器２０３は、ロジスティック回帰、ＳＶＭ、パーセプトロン、Passive-Aggressive、Adaptive Regularization of Weight Vectors （ＡＲＯＷ）、単純ベイズ等のアルゴリズムを用いて機械学習を行うことができる。 The classifier 203 performs machine learning using the feature vector and information indicating the classification of whether the execution file corresponding to the feature vector is malware or goodware. The classifier 203 can perform machine learning using algorithms such as logistic regression, SVM, perceptron, passive-aggressive, adaptive regularization of weight vectors (AROW), and naive Bayes.

次に、判定処理における各部の機能について説明する。マルウェア判定装置２０は、ホワイトリスト適用部２０１によるホワイトリストの適用、特徴抽出・次元削減部２０２による特徴抽出・次元削減、分類器２０３によるスコア算出、判定部２０４によるマルウェアであるか否かの判定の順で判定処理を実行する。判定対象である判定対象ファイルは、マルウェアであるかグッドウェアであるかが不明な実行ファイルである。 Next, the function of each unit in the determination process will be described. The malware determination device 20 applies the white list by the white list application unit 201, the feature extraction / dimension reduction by the feature extraction / dimension reduction unit 202, the score calculation by the classifier 203, and the determination unit 204 determines whether the malware is malware. Judgment processing is executed in this order. The determination target file that is the determination target is an executable file that is unknown whether it is malware or goodware.

ホワイトリスト適用部２０１は、判定対象ファイルがホワイトリストに該当するか否かを判定し、該当すると判定された判定対象ファイルをグッドウェアと判定する。ホワイトリストはＭＤ５、ＳＨＡ１、ＳＨＡ２５６等のハッシュアルゴリズムを用いたハッシュのリストであってもよいし、定義された何らかの類似度の算出に必要な値の組（例えば特徴ベクトル）であってもよい。 The white list application unit 201 determines whether or not the determination target file corresponds to the white list, and determines that the determination target file determined to be applicable is goodware. The white list may be a list of hashes using a hash algorithm such as MD5, SHA1, or SHA256, or may be a set of values (for example, feature vectors) necessary for calculating some defined similarity.

また、特徴抽出・次元削減部２０２は、ホワイトリストに該当しない判定対象ファイルの特徴ベクトルを生成する。なお、特徴ベクトルの生成方法は学習処理の場合と同様である。そして、分類器２０３は、特徴ベクトルから判定対象ファイルのマルウェアらしさをスコアという数値で出力する。 In addition, the feature extraction / dimension reduction unit 202 generates a feature vector of a determination target file that does not correspond to the white list. The feature vector generation method is the same as in the learning process. Then, the classifier 203 outputs the malware-likeness of the determination target file from the feature vector as a numerical value called a score.

判定部２０４は、スコアをもとに判定対象ファイルがマルウェアであるか否かを判定し、判定結果を出力する。判定結果はマルウェア／グッドウェアの分類のみであってもよいし、分類にスコアを付け加えたものであってもよい。判定部２０４は、例えば、スコアがある閾値を超える場合はマルウェア、スコアが閾値を超えない場合はグッドウェアと判定する閾値判定を採用してもよい。 The determination unit 204 determines whether the determination target file is malware based on the score, and outputs a determination result. The determination result may be only the classification of malware / goodware, or may be a classification added with a score. For example, the determination unit 204 may employ threshold determination to determine malware if the score exceeds a certain threshold, and goodware if the score does not exceed the threshold.

［調整事項］
また、学習処理および判定処理は、設定格納部２１０に格納されている設定等を読み込み初期化した後に実行される。設定格納部２１０に格納されている設定等は、これまで説明したマルウェア判定装置２０の学習処理および判定処理を行う各部の処理に影響を与える。 [Adjustments]
The learning process and the determination process are executed after reading and initializing settings stored in the setting storage unit 210. The settings stored in the setting storage unit 210 affect the processing of each unit that performs the learning process and the determination process of the malware determination apparatus 20 described so far.

調整装置１０は、設定格納部２１０に格納する調整事項が調整済みである調整済設定を作成し、マルウェア判定装置２０に出力する。また、調整装置１０は、教師データの調整も行い、調整済教師データをマルウェア判定装置２０に出力するようにしてもよい。この場合、学習処理で用いられる教師データは調整済教師データである。 The adjustment device 10 creates an adjusted setting in which the adjustment items stored in the setting storage unit 210 have been adjusted, and outputs the adjusted setting to the malware determination device 20. The adjustment device 10 may also adjust the teacher data and output the adjusted teacher data to the malware determination device 20. In this case, the teacher data used in the learning process is adjusted teacher data.

ここで、調整事項について具体的な例を挙げて説明する。まず、特徴抽出・次元削減部２０２に関する調整事項の例として、特徴抽出時の重み付け設定（例えばtf-idfの重みを付けるか等）、取捨選択すべき特徴の設定（特徴選択設定）、および次元圧縮に用いるアルゴリズムとそのアルゴリズムのパラメータ（例えば圧縮次元数）が挙げられる。 Here, the adjustment items will be described with specific examples. First, as examples of adjustment items related to the feature extraction / dimension reduction unit 202, weight setting at the time of feature extraction (for example, whether tf-idf is weighted), setting of a feature to be selected (feature selection setting), and dimension The algorithm used for compression and the parameters of the algorithm (for example, the number of compression dimensions) are mentioned.

また、分類器２０３に関する調整事項の例として、利用するアルゴリズム、アルゴリズムの調整パラメータが挙げられる。また、アルゴリズムによっては同じ教師データを反復して学習すると精度向上する場合がある。このようなアルゴリズムを用いる場合、反復学習回数を調整事項としてもよい。 Examples of adjustment items related to the classifier 203 include algorithms to be used and algorithm adjustment parameters. Also, depending on the algorithm, the accuracy may be improved by learning the same teacher data repeatedly. When such an algorithm is used, the number of repeated learnings may be an adjustment item.

なお、ホワイトリスト適用部２０１で用いられるホワイトリスト自体を調整事項としてもよい。また、判定部２０４に関する調整事項の例として、スコアの閾値判定を行う場合のスコア閾値が挙げられる。また、調整装置１０は、事例選択等によって教師データの取捨選択およびデータの並び順の変更等を行うことで、教師データの調整を行い調整済教師データとして出力する。 Note that the white list itself used by the white list application unit 201 may be an adjustment item. Further, as an example of adjustment items related to the determination unit 204, a score threshold in the case where score threshold determination is performed can be given. In addition, the adjustment device 10 adjusts the teacher data by performing selection selection of the teacher data, change of the data arrangement order, and the like by case selection or the like, and outputs the adjusted teacher data.

［調整装置］
図１に示すように、調整装置１０は、指示部１０１、教師・検証用データ作成部１０２、検証部１０３、分析部１０４および設定格納部１１０を有する。調整装置１０はマルウェア判定装置２０の設定や与える教師データ等の調整事項について、調整工程を順に実行する。そして、調整工程で得られる指標がより向上するように調整事項および設定値を決定し、マルウェア判定装置２０に対して調整済教師データおよび調整済設定を出力する。 [Adjustment device]
As illustrated in FIG. 1, the adjustment apparatus 10 includes an instruction unit 101, a teacher / verification data creation unit 102, a verification unit 103, an analysis unit 104, and a setting storage unit 110. The adjustment device 10 sequentially executes adjustment steps for adjustment items such as settings of the malware determination device 20 and teacher data to be given. Then, adjustment items and setting values are determined so that the index obtained in the adjustment process is further improved, and adjusted teacher data and adjusted settings are output to the malware determination device 20.

指示部１０１は、調整工程を管理し、その工程に応じた測定・検証の実行の指示を行う。また、指示部１０１は、マルウェア判定装置２０の各処理を所定の順序で選択することで、マルウェア判定装置２０の各部の処理に対応した調整工程の実施順序を決定する。指示部１０１は、例えば分類器２０３の調整を１番目に設定し、マルウェア判定装置２０において分類器２０３より以前に行われる処理については、処理の順序と逆の順序としてもよい。 The instruction unit 101 manages the adjustment process and gives an instruction to execute measurement / verification in accordance with the process. In addition, the instruction unit 101 determines the execution order of the adjustment process corresponding to the processing of each unit of the malware determination device 20 by selecting each processing of the malware determination device 20 in a predetermined order. For example, the instruction unit 101 may set the adjustment of the classifier 203 first, and the processing performed before the classifier 203 in the malware determination device 20 may be in the reverse order of the processing order.

教師・検証用データ作成部１０２は、指示部１０１の指示に基づき、原教師データ、原検証用データ、調整用データを用いて検証部１０３へ与える教師データおよび検証用データを作成する。 The teacher / verification data creation unit 102 creates teacher data and verification data to be given to the verification unit 103 using the original teacher data, the original verification data, and the adjustment data based on an instruction from the instruction unit 101.

検証部１０３は、マルウェア判定装置２０と同様の処理を実行し、検証を行う。これにより、検証部１０３は、例えば判定結果としてスコアを出力する。そして、分析部１０４は、調整工程に対応する分析工程を実施する。具体的に、分析部１０４は、検証部１０３の学習または判定結果を基に指標を算出し、指標に基づいて最適な調整事項および設定値を決定し、決定した調整事項および設定値を設定格納部１１０へ格納する。 The verification unit 103 performs the same processing as the malware determination device 20 and performs verification. Thereby, the verification part 103 outputs a score as a determination result, for example. Then, the analysis unit 104 performs an analysis process corresponding to the adjustment process. Specifically, the analysis unit 104 calculates an index based on the learning or determination result of the verification unit 103, determines the optimum adjustment item and setting value based on the index, and sets and stores the determined adjustment item and setting value. Stored in the unit 110.

また、原教師データおよび原検証用データはマルウェア判定装置２０で用いられる教師データと同様の構成の情報体である。また、調整用データは誤検知しやすいグッドウェアのデータであり、例えば事例選択手法で用いられるものである（例えば特願２０１５−０８７９２４参照）。 The original teacher data and the original verification data are information bodies having the same configuration as the teacher data used in the malware determination device 20. Further, the adjustment data is goodware data that is easily detected erroneously, and is used, for example, in a case selection method (see, for example, Japanese Patent Application No. 2015-087924).

ここで、図２を用いて、調整事項と設定値の例について説明する。図２は、第１の実施形態に係る調整装置の設定格納部のデータの一例を示す図である。図２の番号１は、マルウェア判定装置２０の分類器２０３に関する調整事項である。また、図２の番号２〜６は、マルウェア判定装置２０の特徴抽出・次元削減部２０２に関する調整事項である。また、図２の番号７は、マルウェア判定装置２０の教師データに関する調整事項である。また、図２の番号８は、マルウェア判定装置２０の判定部２０４に関する調整事項である。また、図２の番号９は、マルウェア判定装置２０のホワイトリスト適用部２０１に関する調整事項である。 Here, examples of adjustment items and setting values will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of data in the setting storage unit of the adjustment device according to the first embodiment. Number 1 in FIG. 2 is an adjustment item related to the classifier 203 of the malware determination device 20. 2 are adjustment items related to the feature extraction / dimension reduction unit 202 of the malware determination apparatus 20. 2 is an adjustment item related to the teacher data of the malware determination device 20. 2 is an adjustment item related to the determination unit 204 of the malware determination apparatus 20. 2 is an adjustment item related to the white list application unit 201 of the malware determination apparatus 20.

ここで、調整装置１０で行われる、各調整工程および各調整工程に対応する分析工程を含む各工程について説明する。なお、各調整工程を実施する前に、指示部１０１は、各工程における指示部１０１が生成する設定の候補の範囲、検証方法、および分析部１０４で算出する指標、目的指標値等をあらかじめ指示として与えられているものとする。また、調整工程および分析工程については図３を用いて説明を行う。図３は、第１の実施形態に係る調整装置の各調整工程を説明するための図である。 Here, each process including the analysis process corresponding to each adjustment process and each adjustment process performed in the adjustment apparatus 10 will be described. Before executing each adjustment step, the instruction unit 101 instructs in advance the setting candidate range generated by the instruction unit 101 in each step, the verification method, the index calculated by the analysis unit 104, the target index value, and the like. As given. The adjustment process and the analysis process will be described with reference to FIG. Drawing 3 is a figure for explaining each adjustment process of the adjustment device concerning a 1st embodiment.

各調整工程において、検証部１０３は、選択された処理を実行するための設定の候補を順次適用し、また、指示部１０１によって選択された処理以外の処理のうち最適な設定が決定済みである処理に、最適な設定を適用する。そして、検証部１０３は、設定の適用が行われるたびに、適用された設定にしたがってマルウェア判定装置２０の各処理を実行した場合の、設定の候補のそれぞれに対応する結果を取得する。 In each adjustment step, the verification unit 103 sequentially applies setting candidates for executing the selected process, and the optimal setting has been determined among the processes other than the process selected by the instruction unit 101. Apply optimal settings for processing. Then, each time the setting is applied, the verification unit 103 acquires a result corresponding to each of the setting candidates when each process of the malware determination apparatus 20 is executed according to the applied setting.

まず、調整工程の前段階の処理として、調整装置１０は、設定格納部１１０の初期化、およびデータのクレンジングを行う。ここで、データクレンジングについて説明する。機械学習によるマルウェア静的判定においては、ファイル種別、すなわち実行ファイルが３２ｂｉｔ実行アプリケーションであるか、６４ｂｉｔ実行アプリケーションであるか、またはVisual Basic製のアプリケーションであるか等に応じて、その特徴の傾向が異なる。そして、特定のファイル種別のみを学習・判定する方が精度向上する。 First, as a process in the previous stage of the adjustment process, the adjustment device 10 performs initialization of the setting storage unit 110 and data cleansing. Here, data cleansing will be described. In the malware static determination by machine learning, the tendency of the characteristics depends on the file type, that is, whether the execution file is a 32-bit execution application, a 64-bit execution application, or a Visual Basic application. Different. The accuracy is improved by learning and determining only a specific file type.

そこで、データクレンジングを行うことにより、原データ（原教師データ、原検証用データ）から指定のファイル種別のみをデータとして選別する。また、マルウェアと間違ってラベルがつけられたグッドウェアが原データ中に存在する場合がある。そこで、データクレンジングにおいては、グッドウェアと非常に類似したマルウェアを原データから除去する。以降の工程では、特記が無い限り、原データはクレンジング済みのものを指すものとする。 Therefore, by performing data cleansing, only the designated file type is selected as data from the original data (original teacher data, original verification data). Also, there may be goodware in the original data that is incorrectly labeled as malware. Therefore, in data cleansing, malware very similar to goodware is removed from the original data. In the subsequent steps, unless otherwise specified, the original data indicates the cleansed data.

次に、図３に示すように、調整装置１０は１番目の調整工程として分類器の調整を行う。指示部１０１は分類器のアルゴリズム、そのアルゴリズムのパラメータ、反復学習回数等の設定の組の候補をいくつか生成し、そのうちの１組を設定格納部１１０へ仮設定する。そして、教師・検証用データ作成部１０２は、原データを用いて、教師データおよび検証用データを生成する。 Next, as shown in FIG. 3, the adjustment device 10 adjusts the classifier as the first adjustment step. The instructing unit 101 generates several setting group candidates such as the classifier algorithm, the algorithm parameters, the number of iteration learning, and the like, and temporarily sets one of them in the setting storage unit 110. Then, the teacher / verification data creation unit 102 generates teacher data and verification data using the original data.

検証部１０３は、設定格納部１１０に仮設定された設定等に基づき、その教師データおよび検証用データで学習・判定し、検証する。この検証がすべての設定の組の候補について実施されると、分析部１０４は、各検証結果について指標を算出し、最も良い指標であった設定の組を選出し、決定した調整事項としてその設定を設定格納部１１０へ格納する。 The verification unit 103 performs learning / determination using the teacher data and the verification data based on the settings temporarily set in the setting storage unit 110 and performs verification. When this verification is performed for all the set candidates, the analysis unit 104 calculates an index for each verification result, selects the setting set that was the best index, and sets the determined adjustment items as the adjustment items. Is stored in the setting storage unit 110.

検証方法の一例として、そのまま原教師データを教師データ、原検証用データを検証用データとして学習・判定を行うホールドアウト検証がある。その他の検証方法の例として、Ｋ−分割交差検証が挙げられる。Ｋ−分割交差検証では、まず原教師データと原検証用データを混合し、Ｋ個のデータを生成する。次に、１個のデータを検証用データとして選び、残りのデータを教師データとして学習・判定する。まだ検証用データとしていないデータがあれば、それを検証用データとして選んで学習・判定を行い、全てのデータが１度は検証用データとして選ばれるまで学習・判定を繰り返す。 As an example of the verification method, there is holdout verification in which learning and determination are performed using the original teacher data as teacher data and the original verification data as verification data. An example of another verification method is K-division cross verification. In K-division cross-validation, first, original teacher data and original verification data are mixed to generate K pieces of data. Next, one piece of data is selected as verification data, and the remaining data is learned and determined as teacher data. If there is data that is not yet verified data, it is selected as data for verification, learning and determination are performed, and learning and determination are repeated until all data is selected as verification data once.

指標の例として、機械学習の検証では良く用いられるＡＵＣやＦ値が挙げられる。また、指定の誤検知率以下となるように調整した際の検知率や、指定の検知率以上となるように調整した際の誤検知率を１から引いたもの（真陰性率）を指標として用いてもよい。 Examples of the index include AUC and F value that are often used in machine learning verification. In addition, the detection rate when adjusted to be equal to or less than the specified false detection rate, or the false detection rate when adjusted to be equal to or higher than the specified detection rate (true negative rate) is used as an index. It may be used.

マルウェア判定において、スコア閾値による判定を用いる場合、その閾値を増大させると検知率は下がるが、誤検知率も下がる。逆に、閾値を減少させると、検知率が上がり、誤検知率も上がる。この閾値と検知率および誤検知率の関係は比例関係ではなく、図４に示すように、ある閾値の範囲では閾値を増やしても誤検知率はそれほど上がらず、検知率は非常に上がることが知られている。図４は、スコア閾値と、検知率および誤検知率との関係を説明するための図である。また、このとき誤検知率と検知率の関係は、図５に示すようにＲＯＣ曲線に従う。図５は、ＲＯＣ曲線について説明するための図である。 In the malware determination, when the determination based on the score threshold is used, increasing the threshold decreases the detection rate but also decreases the false detection rate. Conversely, when the threshold value is decreased, the detection rate increases and the false detection rate also increases. The relationship between the threshold value, the detection rate, and the false detection rate is not a proportional relationship. As shown in FIG. 4, even if the threshold value is increased within a certain threshold range, the false detection rate does not increase so much, and the detection rate can be very high. Are known. FIG. 4 is a diagram for explaining the relationship between the score threshold, the detection rate, and the false detection rate. At this time, the relationship between the false detection rate and the detection rate follows an ROC curve as shown in FIG. FIG. 5 is a diagram for explaining the ROC curve.

そのため、あらかじめ適切に閾値を調整することにより、指定の許容範囲内に誤検知率を収めつつ、比較的高い検知率を得られるようにすることができる。同様に、スコア閾値を調整することによって、指定の検知率以上となるようにすることもできる。なお、これらのスコア閾値は分析部１０４にて判定結果のスコアを分析することで決定する。 Therefore, by appropriately adjusting the threshold value in advance, it is possible to obtain a relatively high detection rate while keeping the false detection rate within a specified allowable range. Similarly, by adjusting the score threshold, it is possible to achieve a specified detection rate or higher. These score thresholds are determined by analyzing the score of the determination result in the analysis unit 104.

次に、図３に示すように、調整装置１０は２番目の調整工程として次元削減の調整を行う。このとき、検証に用いるデータおよび分類器の設定は、１番目の調整工程で決定され、設定格納部１１０に格納されているものを用いる。次元削減の一つの手法である特徴選択における設定項目として、例えばＦｉｌｅｎａｍｅ、Ｆｉｌｅｓｉｚｅ等がある（例えば特願２０１４−１２０４２８を参照）。もう一方の手法として、次元圧縮が挙げられ、次元圧縮の代表例として、主成分分析がある。 Next, as illustrated in FIG. 3, the adjustment device 10 performs dimension reduction adjustment as the second adjustment step. At this time, data used for verification and classifier settings are determined in the first adjustment step and stored in the setting storage unit 110. Examples of setting items in feature selection, which is one method of dimension reduction, include File name, File size, and the like (see, for example, Japanese Patent Application No. 2014-120428). Another method is dimensional compression. A typical example of dimensional compression is principal component analysis.

主成分分析での設定項目は、例えば圧縮次元数である。指示部１０１は特徴抽出の重み、特徴選択設定および圧縮次元数の候補をいくつか作成し、そのうちの１つを設定格納部１１０に仮設定する。そして、教師・検証用データ作成部１０２は、原データを用いて、教師データおよび検証用データを生成する。 The setting item in the principal component analysis is, for example, the number of compression dimensions. The instructing unit 101 creates several candidates for feature extraction weights, feature selection settings, and number of compression dimensions, and temporarily sets one of them in the setting storage unit 110. Then, the teacher / verification data creation unit 102 generates teacher data and verification data using the original data.

特徴選択設定の候補の生成については、総当たり、変数増加法、変数減少法、ステップワイズ法等の手法を用いることができる（例えば特願２０１４−１２０４２８参照）。また、検証手法および指標は、他の工程と同一のものを用いても良いし、場合によっては別のものを用いてもよい。例えば、前の工程では分割交差検証を用い、本工程では処理時間を短縮するため、ホールドアウト検証を用いるようにしてもよい。 For the generation of feature selection setting candidates, techniques such as brute force, variable increase method, variable decrease method, stepwise method, and the like can be used (for example, see Japanese Patent Application No. 2014-120428). Further, the same verification method and index as those used in the other steps may be used, or different methods may be used depending on circumstances. For example, split intersection verification may be used in the previous process, and holdout verification may be used in this process in order to shorten the processing time.

次に、図３に示すように、調整装置１０は３番目の調整工程として教師データの調整を行う。このとき、検証に用いるデータ、分類器および特徴抽出・次元削減の設定は、１番目および２番目の調整工程で決定され、設定格納部１１０に格納されているものを用いる。この場合、教師・検証用データ作成部１０２は、検証部１０３が他の工程で行っているように、選択された処理を実行するためのデータの候補を設定の候補として順次適用し、また、指示部１０１によって選択された処理以外の処理のうち最適な設定が決定済みである処理に、最適な設定を適用する。 Next, as shown in FIG. 3, the adjustment device 10 adjusts teacher data as a third adjustment step. At this time, the data used for verification, the classifier, and the feature extraction / dimension reduction settings are determined in the first and second adjustment steps and stored in the setting storage unit 110. In this case, the teacher / verification data creation unit 102 sequentially applies the data candidates for executing the selected process as the setting candidates, as the verification unit 103 performs in other steps, and The optimum setting is applied to the process for which the optimum setting has been determined among the processes other than the process selected by the instruction unit 101.

分類器によっては教師データの並び順がその判定精度に大きく影響する場合がある。データ順が精度に影響する分類器として、例えば、パーセプトロン、Passive-Aggressive、ＡＲＯＷ、ニューラルネットワーク等がある。また、教師データとして、機械学習させると逆に判定精度が悪化するデータがある。一例として、余りにも古いデータは、直近のデータと傾向が異なり過ぎて、悪影響を及ぼすことがある。 Depending on the classifier, the order of the teacher data may greatly affect the determination accuracy. Examples of classifiers whose data order affects accuracy include perceptron, passive-aggressive, AROW, and neural network. In addition, as teacher data, there is data in which the determination accuracy deteriorates when machine learning is performed. As an example, data that is too old may be adversely affected because its tendency is too different from the most recent data.

指示部１０１および教師・検証用データ作成部１０２は、原教師データから含有するデータの期間および並び順等が異なった教師データの候補をいくつか作成する。検証部１０３は、作成されたそれぞれの教師データ候補と検証用データを、これまでの工程で決定した設定等に基づき、学習・判定してホールドアウト検証を行う。そして、分析部１０４は各候補の指標を算出する。 The instruction unit 101 and the teacher / verification data creation unit 102 create several teacher data candidates having different periods and arrangement order of the data contained from the original teacher data. The verification unit 103 performs holdout verification by learning and determining each of the created teacher data candidates and verification data based on the settings determined in the previous steps. Then, the analysis unit 104 calculates an index for each candidate.

この検証がすべての設定の組の候補について実施されると、分析部１０４は、各検証結果について指標を算出し、最も良い指標であった教師データ候補を選出し、調整済教師データとして決定する。 When this verification is performed for all the set candidates, the analysis unit 104 calculates an index for each verification result, selects a teacher data candidate that is the best index, and determines it as adjusted teacher data. .

また、誤検知をより低減させる手法として、誤検知グッドウェアのデータを調整用データとして用いた事例選択手法が考えられる。事例選択手法によって、調整済教師データをさらに調整し、より判定精度を向上させることができる（例えば特願２０１５−０８７９２４参照）。 Further, as a technique for further reducing false detection, a case selection technique using erroneous detection goodware data as adjustment data can be considered. By using the case selection method, the adjusted teacher data can be further adjusted to further improve the determination accuracy (see, for example, Japanese Patent Application No. 2015-087924).

また、検証手法および指標は、他の工程と同一のものを用いても良いし、場合によっては別のものを用いてもよい。例えば、事例選択手法を用いると、より誤検知率を低減させることができる（例えば特願２０１５−０８７９２４参照）。そのため、例えば本工程では前の工程より指定誤検知率が低い指標を用いるようにしてもよい。 Further, the same verification method and index as those used in the other steps may be used, or different methods may be used depending on circumstances. For example, when the case selection method is used, the false detection rate can be further reduced (for example, see Japanese Patent Application No. 2015-087924). Therefore, for example, in this step, an index having a lower designated false detection rate than the previous step may be used.

次に、図３に示すように、調整装置１０は４番目の調整工程として判定設定の調整およびホワイトリストの生成を行う。このとき、検証に用いるデータ、分類器および特徴抽出・次元削減の設定は、１番目から３番目の調整工程で決定され、設定格納部１１０に格納されているものを用いる。 Next, as illustrated in FIG. 3, the adjustment device 10 adjusts the determination setting and generates a white list as the fourth adjustment step. At this time, the data used for verification, the classifier, and the feature extraction / dimension reduction setting are determined in the first to third adjustment steps and stored in the setting storage unit 110.

この調整工程は、マルウェア判定装置２０がスコア閾値による判定を行う場合に行われるものである。まず、これまでの調整工程で調整された設定、調整済教師データ、および検証用データを用いて、検証部１０３はホールドアウト検証を行う。次に、分析部１０４は指標を算出し、その指標が指示等で与えられた目的指標値に達しているか否かを判定する。目的指標値として、例えば指定誤検知率以下に調整した検知率を用いてもよい。 This adjustment process is performed when the malware determination apparatus 20 performs determination based on the score threshold. First, the verification unit 103 performs holdout verification using the settings adjusted in the adjustment process so far, adjusted teacher data, and verification data. Next, the analysis unit 104 calculates an index and determines whether or not the index has reached a target index value given by an instruction or the like. As the target index value, for example, a detection rate adjusted to be equal to or lower than a specified erroneous detection rate may be used.

目的指標値が達成された場合は、上記検証において分析部１０４で算出されたスコア閾値をそのまま設定格納部１１０に格納する。一方、目的指標値が達成されなかった場合は、分析部１０４は以下の処理を行い、ホワイトリストとスコア閾値を決定する。まず、分析部１０４は、検知率の最低値を越えるスコア閾値を算出する。次に、指定誤検知率以下となる個数だけ高いスコアから順にグッドウェアを抽出し、ホワイトリストとする。分析部１０４は、このようにして得たスコア閾値とホワイトリストを設定格納部１１０に格納する。 When the objective index value is achieved, the score threshold value calculated by the analysis unit 104 in the verification is stored in the setting storage unit 110 as it is. On the other hand, when the target index value is not achieved, the analysis unit 104 performs the following processing to determine a white list and a score threshold. First, the analysis unit 104 calculates a score threshold value that exceeds the minimum detection rate. Next, goodware is extracted in order from the score that is higher by the number that is equal to or lower than the specified false detection rate, and is used as a white list. The analysis unit 104 stores the score threshold value and white list obtained in this way in the setting storage unit 110.

そして、調整工程が完了すると、調整装置１０は、決定した調整事項、すなわち設定格納部１１０に格納された各種設定および調整済教師データを出力する。マルウェア判定装置２０は、これらの出力を読み込み、調整事項を各部に反映させる。なお、調整装置１０は電子ファイルとして出力を行ってもよいし、通信上のデータとして出力を行ってもよい。 When the adjustment process is completed, the adjustment device 10 outputs the determined adjustment items, that is, various settings and adjusted teacher data stored in the setting storage unit 110. The malware determination device 20 reads these outputs and reflects the adjustment items in each unit. In addition, the adjustment apparatus 10 may output as an electronic file, and may output as data on communication.

また、各工程において、分析部１０４が各候補から設定等を選出し決定する処理には、人間による判断が含まれていてもよい。例えば、指標によっては、複数の候補で同一値の最良の指標値が計測されることがある。その場合は、それら候補を人間が判断してどれか１つに決定することが考えられる。 Further, in each step, the process in which the analysis unit 104 selects and determines the setting from each candidate may include human judgment. For example, depending on the index, the best index value having the same value for a plurality of candidates may be measured. In such a case, it is conceivable that a candidate can determine one of these candidates.

［第１の実施形態の処理］
第１の実施形態の処理について説明する。まず、図６を用いて調整装置１０の処理について説明する。図６は、第１の実施形態に係る調整装置の処理を示すフローチャートである。 [Process of First Embodiment]
The process of the first embodiment will be described. First, the process of the adjustment apparatus 10 is demonstrated using FIG. FIG. 6 is a flowchart illustrating the process of the adjustment device according to the first embodiment.

図６に示すように、指示部１０１は、指示等を読み込み、設定格納部１１０の初期設定を行う（ステップＳ１０１）。次に、教師・検証用データ作成部１０２は、原教師データ、調整用データ、原検証用データから誤りや対象外のデータ等を除去するデータクレンジングを行う（ステップＳ１０２）。そして、指示部１０１は、実行する調整工程を指定する（ステップＳ１０３）。 As illustrated in FIG. 6, the instruction unit 101 reads an instruction or the like and performs initial setting of the setting storage unit 110 (step S <b> 101). Next, the teacher / verification data creation unit 102 performs data cleansing that removes errors, non-target data, and the like from the original teacher data, adjustment data, and original verification data (step S102). And the instruction | indication part 101 designates the adjustment process to perform (step S103).

そして、指示部１０１は各部に調整工程における検証の実行を指示する（ステップＳ１０４）。教師・検証用データ作成部１０２は、原教師データ、調整用データおよび原検証用データから教師データおよび検証用データを作成し、設定格納部１１０へ格納する（ステップＳ１０５）。そして、検証部１０３は、設定格納部１１０の設定等に基づき、教師データを学習し、検証用データを判定する（ステップＳ１０６）。 Then, the instruction unit 101 instructs each unit to execute verification in the adjustment process (step S104). The teacher / verification data creation unit 102 creates teacher data and verification data from the original teacher data, adjustment data, and original verification data, and stores them in the setting storage unit 110 (step S105). Then, the verification unit 103 learns teacher data based on the setting of the setting storage unit 110 and determines verification data (step S106).

ここで、指示部１０１は、現工程の全ての検証が完了したか否かを判定する（ステップＳ１０７）。なお、全ての検証が完了したか否かは、例えば設定の候補の全てについて検証データの判定が行われたか否かによって判定される。また、指示部１０１は、全ての検証が完了していないと判定した場合（ステップＳ１０７、Ｎｏ）、ステップＳ１０４へ戻り、さらに検証を実行させる。 Here, the instruction unit 101 determines whether or not all verification of the current process is completed (step S107). Note that whether or not all the verifications have been completed is determined, for example, by whether or not verification data has been determined for all of the setting candidates. If the instruction unit 101 determines that all the verifications have not been completed (No in step S107), the instruction unit 101 returns to step S104 and further performs verification.

指示部１０１が全ての検証が完了したと判定した場合（ステップＳ１０７、Ｙｅｓ）、分析部１０４は、各調整工程に対応した分析方法によって分析を行い、決定した調整事項を設定格納部１１０へ格納する（ステップＳ１０８）。 When the instruction unit 101 determines that all the verifications have been completed (step S107, Yes), the analysis unit 104 performs analysis using an analysis method corresponding to each adjustment process, and stores the determined adjustment items in the setting storage unit 110. (Step S108).

ここで、指示部１０１は、全ての調整工程が完了していない場合（ステップＳ１０９、Ｎｏ）、ステップＳ１０３へ戻り、さらに調整工程を指定する。また、全調整工程が完了した場合（ステップＳ１０９、Ｙｅｓ）、調整装置１０は調整された設定等を出力し（ステップＳ１１０）、処理を終了する。なお、各調整工程の検証方法および分析方法は図３に示す通りである。 Here, when all the adjustment processes are not completed (No at Step S109), the instruction unit 101 returns to Step S103 and further specifies the adjustment process. When all the adjustment processes are completed (step S109, Yes), the adjustment device 10 outputs the adjusted settings and the like (step S110) and ends the process. In addition, the verification method and analysis method of each adjustment process are as showing in FIG.

次に、図７を用いてマルウェア判定装置２０の処理について説明する。図７は、第１の実施形態に係るマルウェア判定装置の処理を示すフローチャートである。まず、学習処理について説明する。学習処理においては、マルウェアであるかグッドウェアであるかが既知の教師データが入力される。図７に示すように、学習を行う場合、特徴抽出・次元削減部２０２は教師データから特徴ベクトルを生成する（ステップＳ２０１）。そして、分類器２０３は特徴ベクトルを用いて学習を行う（ステップＳ２０２）。 Next, processing of the malware determination device 20 will be described with reference to FIG. FIG. 7 is a flowchart showing processing of the malware determination device according to the first embodiment. First, the learning process will be described. In the learning process, teacher data that is known as malware or goodware is input. As shown in FIG. 7, when learning is performed, the feature extraction / dimension reduction unit 202 generates a feature vector from teacher data (step S201). Then, the classifier 203 performs learning using the feature vector (step S202).

次に、判定処理について説明する。判定処理においては、マルウェアであるかグッドウェアであるかが未知の判定対象データが入力される。図７に示すように、ホワイトリスト適用部２０１は、対象ファイルがホワイトリストに該当する場合（ステップＳ２１１、Ｙｅｓ）、対象ファイルがグッドウェアであると判定し、処理を終了する。 Next, the determination process will be described. In the determination process, determination target data that is unknown whether it is malware or goodware is input. As illustrated in FIG. 7, when the target file corresponds to the white list (Yes in step S <b> 211), the white list application unit 201 determines that the target file is goodware and ends the process.

対象ファイルがホワイトリストに該当しない場合（ステップＳ２１１、Ｎｏ）、特徴抽出・次元削減部２０２は、対象ファイルの特徴抽出および必要に応じて次元削減を行い、特徴ベクトルを生成する（ステップＳ２１２）。そして、分類器２０３は、特徴ベクトルからマルウェアらしさのスコアを算出する（ステップＳ２１３）。そして、判定部２０４はスコアからマルウェア判定を行う（ステップＳ２１４）。 When the target file does not correspond to the white list (step S211, No), the feature extraction / dimension reduction unit 202 performs feature extraction of the target file and dimension reduction as necessary to generate a feature vector (step S212). Then, the classifier 203 calculates a malware-like score from the feature vector (step S213). And the determination part 204 performs malware determination from a score (step S214).

［第１の実施形態の効果］
調整装置１０は、所定の順序で１つ以上の前処理を実行した後、分類器によってファイルのスコアを算出するスコア算出処理を実行し、算出したスコアに基づいて該ファイルがマルウェアであるか否かの判定処理を行うマルウェア判定装置２０における、各処理を実行するための最適な設定を決定する。 [Effect of the first embodiment]
After executing one or more pre-processes in a predetermined order, the adjustment apparatus 10 executes a score calculation process for calculating a score of the file by the classifier, and whether or not the file is malware based on the calculated score The optimal setting for executing each process in the malware determination apparatus 20 that performs the determination process is determined.

指示部１０１は、マルウェア判定装置２０の各処理を所定の順序で選択する。教師・検証用データ作成部１０２および検証部１０３は、指示部１０１によって処理が選択されるたびに、選択された処理を実行するための設定の候補を順次適用し、また、指示部１０１によって選択された処理以外の処理のうち最適な設定が決定済みである処理に、最適な設定を適用する。そして、検証部１０３は、設定の適用が行われるたびに、適用された設定にしたがってマルウェア判定装置２０の各処理を実行した場合の、設定の候補のそれぞれに対応する結果を取得する。 The instruction unit 101 selects each process of the malware determination device 20 in a predetermined order. Each time the instruction unit 101 selects a process, the teacher / verification data creation unit 102 and the verification unit 103 sequentially apply the setting candidates for executing the selected process, and the instruction unit 101 selects the process. The optimum setting is applied to the process for which the optimum setting has been determined among the processes other than the processed process. Then, each time the setting is applied, the verification unit 103 acquires a result corresponding to each of the setting candidates when each process of the malware determination apparatus 20 is executed according to the applied setting.

分析部１０４は、教師・検証用データ作成部１０２および検証部１０３によって選択された処理の設定の候補のすべてについて対応する結果が取得されるたびに、設定の候補のうち、設定の候補のそれぞれに対応する結果からマルウェアの判定精度が最も高くなると判定された設定の候補を、選択した処理の最適な設定として決定する。 Each time the analysis unit 104 obtains corresponding results for all of the process setting candidates selected by the teacher / verification data creation unit 102 and the verification unit 103, the analysis unit 104 sets each of the setting candidates. The candidate of the setting determined that the determination accuracy of malware is the highest from the result corresponding to is determined as the optimal setting of the selected process.

これによって、マルウェア判定装置２０の複数の処理の調整を行う場合であっても、各処理と設定の候補の組み合わせの全てについて検証および分析を行う必要がなく、精度向上手法の調整事項の調整を効率良く行うことができる。 As a result, even when a plurality of processes of the malware determination apparatus 20 are adjusted, it is not necessary to perform verification and analysis for all combinations of each process and setting candidates, and adjustment items of the accuracy improvement method can be adjusted. It can be done efficiently.

また、指示部１０１は、例えばマルウェア判定装置２０において実行される順序と逆の順序で各処理を選択する。また、マルウェア判定装置２０が、ホワイトリストの適用、特徴抽出および次元削減の順で前処理を行い、ファイルがマルウェアであるか否かの判定を閾値によって行う場合、指示部１０１は、スコア算出処理、特徴抽出および次元削減、ホワイトリストの適用および判定処理の順で各処理を指定するようにしてもよい。また、調整装置１０は、各部を用いて教師データの調整を行うようにしてもよい。このように、分類器に密接な調整事項から順に調整を行っていくことで、より効果的な精度向上のための調整を効率的に行うことができる。 The instruction unit 101 selects each process in the reverse order to the order executed in the malware determination device 20, for example. When the malware determination apparatus 20 performs preprocessing in the order of whitelist application, feature extraction, and dimension reduction, and determines whether a file is malware or not using a threshold value, the instruction unit 101 performs score calculation processing. Each process may be specified in the order of feature extraction and dimension reduction, white list application, and determination process. Further, the adjustment device 10 may adjust the teacher data using each unit. In this way, by making adjustments in order starting from adjustment items that are closely related to the classifier, it is possible to efficiently make adjustments for more effective accuracy improvement.

［第２の実施形態］
第１の実施形態においては、マルウェア判定装置が学習処理と判定処理の両方を行う場合について説明した。一方、第２の実施形態においては、マルウェア判定装置に学習機能が備わっていない場合の例について説明する。 [Second Embodiment]
In the first embodiment, the case where the malware determination device performs both the learning process and the determination process has been described. On the other hand, in 2nd Embodiment, the example in case the learning function is not provided in the malware determination apparatus is demonstrated.

［第２の実施形態の構成］
図８を用いて、第２の実施形態に係る調整装置およびマルウェア判定装置を含んだマルウェア判定システムの構成について説明する。図８は、第２の実施形態に係る調整装置およびマルウェア判定装置を含んだマルウェア判定システムの構成の一例を示す図である。 [Configuration of Second Embodiment]
The configuration of the malware determination system including the adjustment device and the malware determination device according to the second embodiment will be described with reference to FIG. FIG. 8 is a diagram illustrating an example of a configuration of a malware determination system including the adjustment device and the malware determination device according to the second embodiment.

図８に示すように、マルウェア判定装置２０の分類器２０３ａは、第１の実施形態と異なり、学習機能が備わっていない。そのため、マルウェア判定装置２０は、教師データによってマルウェア判定のための学習済みの識別モデルを作成することができない。そこで、調整装置１０の検証部１０３は、マルウェア判定のための識別モデルを機械学習により取得し、取得した識別モデルを調整済設定とともにマルウェア判定装置２０に対して出力する。 As shown in FIG. 8, the classifier 203a of the malware determination device 20 does not have a learning function unlike the first embodiment. Therefore, the malware determination device 20 cannot create a learned identification model for malware determination using teacher data. Therefore, the verification unit 103 of the adjustment device 10 acquires an identification model for malware determination by machine learning, and outputs the acquired identification model to the malware determination device 20 together with the adjusted setting.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵ（Central Processing Unit）および当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. Further, all or any part of each processing function performed in each device is realized by a CPU (Central Processing Unit) and a program analyzed and executed by the CPU, or hardware by wired logic. Can be realized as

また、本実施形態において説明した各処理のうち、自動的におこなわれるものとして説明した処理の全部または一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 In addition, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
図９は、プログラムが実行されることにより、調整装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 [program]
FIG. 9 is a diagram illustrating an example of a computer in which the adjustment apparatus is realized by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to the display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、調整装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、調整装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the adjustment apparatus 10 is implemented as a program module 1093 in which a code executable by a computer is described. The program module 1093 is stored in the hard disk drive 1090, for example. For example, a program module 1093 for executing processing similar to the functional configuration in the adjustment apparatus 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 and executes them as necessary.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３およびプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３およびプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

１マルウェア判定システム
１０調整装置
２０マルウェア判定装置
１０１指示部
１０２教師・検証用データ作成部
１０３検証部
１０４分析部
１１０、２１０設定格納部
２０１ホワイトリスト適用部
２０２特徴抽出・次元削減部
２０３、２０３ａ分類器
２０４判定部 DESCRIPTION OF SYMBOLS 1 Malware determination system 10 Adjustment apparatus 20 Malware determination apparatus 101 Instruction part 102 Teacher and verification data creation part 103 Verification part 104 Analysis part 110, 210 Setting storage part 201 White list application part 202 Feature extraction / dimension reduction part 203, 203a Classification 204 Determination part

Claims

After executing one or more pre-processes in a predetermined order, a score calculation process for calculating a score of a file by a classifier is performed, and a determination process for determining whether the file is malware based on the calculated score An adjustment device for determining an optimum setting for executing each process in the malware determination device to perform,
An instruction unit for selecting each process of the malware determination device in an order reverse to the order executed in the malware determination device ;
Each time a process is selected by the instruction unit, the setting candidates for executing the selected process are sequentially applied, and an optimum setting is determined among processes other than the process selected by the instruction unit. A setting application unit that applies the optimum setting to the already processed process;
Each time a setting is applied by the setting application unit, a result corresponding to each of the setting candidates is obtained when each process of the malware determination apparatus is executed according to the setting applied by the setting application unit. A verification unit to
Each time the result corresponding to all of the setting candidates of the selected process is acquired by the verification unit, the determination accuracy of malware is determined from the result corresponding to each of the setting candidates among the setting candidates. An analysis unit that determines a setting candidate determined to be the highest as an optimum setting of the selected process;
The adjustment apparatus characterized by having.

2. The process according to claim 1, wherein the instructions are selected by the instruction unit in an order reverse to the order executed in the malware determination device, and are whitelist application, feature extraction, and dimension reduction . Adjustment device.

Adjustment device according to claim 1 or 2, characterized by further comprising a training data adjusting unit for adjusting the training data.

Learning teacher data, the adjusting device according to any one of claims 1 to 3, further comprising a learning unit for creating an identification model for the score calculation process in the classifier.

After executing one or more pre-processes in a predetermined order, a score calculation process for calculating a score of a file by a classifier is performed, and a determination process for determining whether the file is malware based on the calculated score An adjustment method that is executed by an adjustment device that determines an optimal setting for executing each process in the malware determination device to perform,
An instruction step of selecting each process of the malware determination device in an order reverse to the order executed in the malware determination device ;
Each time a process is selected by the instruction process, the setting candidates for executing the selected process are sequentially applied, and an optimal setting is determined among processes other than the process selected by the instruction process. A setting application step for applying the optimum setting to the process that has been completed;
Each time a setting is applied by the setting application step, a result corresponding to each of the setting candidates is obtained when each process of the malware determination device is executed according to the setting applied by the setting application step. A verification process to
Whenever the result corresponding to all the setting candidates of the selected process is acquired by the verification step, the determination accuracy of malware is determined from the result corresponding to each of the setting candidates among the setting candidates. An analysis step of determining a candidate for the setting determined to be the highest as the optimum setting for the selected process;
The adjustment method characterized by including.

An adjustment program for causing a computer to function as the adjustment device according to any one of claims 1 to 4 .