JP2011086147A

JP2011086147A - Device, method and program for calculating similarity

Info

Publication number: JP2011086147A
Application number: JP2009239014A
Authority: JP
Inventors: Makoto Iwamura; 誠岩村; Mitsuyasu Ito; 光恭伊藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-10-16
Filing date: 2009-10-16
Publication date: 2011-04-28
Anticipated expiration: 2029-10-16
Also published as: JP5301411B2

Abstract

<P>PROBLEM TO BE SOLVED: To highly accurately calculate the similarity of machine word instruction strings by a small amount of computational complexity. <P>SOLUTION: A similarity calculation device 10 includes: a contracted instruction string generation part 14c for generating a contracted instruction string 13b which is an array of contracted instructions obtained by excluding operand portions of respective machine word instructions included in a machine word instruction string; a longest common section string extraction part 14d for mutually comparing contracted instruction strings 13b generated by the contracted instruction string generation part 14c and extracting the longest common section string; and a similarity calculation part 14e for calculating the similarity of the respective machine word instruction strings on the basis of the longest common section strings extracted by the longest common section string extraction part 14d. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、機械語命令列の類似性を算出する類似性算出装置、類似性算出方法および類似性算出プログラムに関する。 The present invention relates to a similarity calculation device, a similarity calculation method, and a similarity calculation program for calculating the similarity of machine language instruction sequences.

近年、インターネット等のネットワークの普及にともなって、コンピュータウィルスやワームといったマルウェアへの対策に大きなコストがかかるようになっている。マルウェアへの対策にコストがかかることの要因の一つは、マルウェア作者によって積極的に亜種が開発されるために、マルウェアの種類数が増加し、マルウェアの解析作業に時間を要することにある。 In recent years, with the spread of networks such as the Internet, measures against malware such as computer viruses and worms have become costly. One of the reasons for the cost of malware countermeasures is that the malware authors are actively developing variants, which increases the number of malware types and takes time to analyze malware. .

そこで、マルウェアを分類することでその解析コストを抑える研究が行われている。マルウェアを分類する手法には、大きく分けて挙動に基づいて分類する手法とプログラムコード（機械語命令列）に基づいて分類する手法とがある。 Therefore, research to reduce the analysis cost by classifying malware is being conducted. There are two methods for classifying malware: a method of classifying based on behavior and a method of classifying based on program code (machine language instruction sequence).

挙動に基づいて分類する手法では、ファイルシステムやネットワーク等のシステムリソースへのアクセスを監視できる環境が用意され、その環境において実際にマルウェアを動作させることでマルウェアの挙動に関する情報が取得される。そして、取得された挙動に関する情報の類似性をマルウェアの類似性とみなすことでマルウェアが分類される（非特許文献１参照）。 In the method of classifying based on behavior, an environment in which access to system resources such as a file system and a network can be monitored is prepared, and information on the behavior of the malware is acquired by actually operating the malware in the environment. And the malware is classified by regarding the similarity of the information regarding the acquired behavior as the similarity of the malware (see Non-Patent Document 1).

一方、プログラムコードに基づいて分類する手法では、挙動による手法とは異なり、マルウェアに内在する機能も踏まえた分類が可能である。プログラムコードに基づいて分類する手法には、プログラムコードの類似性の算出方法によって、以下に説明するようないくつかの手法が存在する。 On the other hand, the classification based on the program code allows classification based on the functions inherent in malware, unlike the behavioral technique. As a method of classifying based on the program code, there are several methods as described below depending on the method of calculating the similarity of the program code.

ある手法では、プログラムコードを逆アセンブルすることで命令種別の系列を抽出し、頻出するＮ−ｐｅｒｍｓ（順序性を持たないＮ個の命令種別系列）を特徴として用いることで、マルウェアの類似度が算出される（非特許文献２参照）。この手法によれば、順序性を持つＮ−ｇｒａｍｓと比較して順序性を持たないＮ−ｐｅｒｍｓを用いることで、コンパイラの最適化により発生する命令の入れ替えの影響を緩和することを期待できる。また、他の手法では、プログラムコードを逆アセンブルし、コールツリー（関数の呼び出し関係を表す木）を構築し、その木構造の類似性がマルウェアの類似性とされる（非特許文献３参照）。 In one method, a sequence of instruction types is extracted by disassembling the program code, and N-perms that frequently appear (N instruction type sequences having no order) are used as features, so that the degree of malware similarity is increased. Is calculated (see Non-Patent Document 2). According to this method, it can be expected that the influence of instruction replacement generated by compiler optimization is mitigated by using N-perms that do not have order as compared with N-grams that have order. In another method, the program code is disassembled to construct a call tree (a tree representing a function calling relationship), and the similarity of the tree structure is regarded as the similarity of malware (see Non-Patent Document 3). .

特開２００９−１９３１６１号公報JP 2009-193161 A

M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian, and J. Nazario. "Automated classification and analysis of internet malware". In Proceedings of the 10th Symposium on Recent Advances in Intrusion Detection (RAID'07). pages 178--197, 2007.M. Bailey, J. Oberheide, J. Andersen, ZM Mao, F. Jahanian, and J. Nazario. "Automated classification and analysis of internet malware". In Proceedings of the 10th Symposium on Recent Advances in Intrusion Detection (RAID'07 pages 178--197, 2007. Karim, M. E., Walenstein, A., Lakhotia, A., and Parida, L. "Malware Phylogeny Generation using Permutations of Code". European Research Journal of Computer Virology 1, 1-2 (Nov. 2005) 13--23.Karim, M. E., Walenstein, A., Lakhotia, A., and Parida, L. "Malware Phylogeny Generation using Permutations of Code". European Research Journal of Computer Virology 1, 1-2 (Nov. 2005) 13--23. Ero Carrera and Gergely Erdelyi, "Digital Genome Mapping -- Advanced Binary Malware Analysis". Virus Bulletin Conference September 2004.Ero Carrera and Gergely Erdelyi, "Digital Genome Mapping-Advanced Binary Malware Analysis". Virus Bulletin Conference September 2004. D.S. Hirschberg, A linear space algorithm for computing maximal common subsequences, Comm. Assoc. Comput. Mach., 18:6, 341.343, 1975.D.S.Hirschberg, A linear space algorithm for computing maximal common subsequences, Comm.Assoc.Comput.Mach., 18: 6, 341.343, 1975. Maxime Crochemore, Costas S. Iliopoulos, Yoan J. Pinzon: Speeding-up Hirschberg and Hunt-Szymanski LCS Algorithms. Fundam. Inform. 56(1-2): 89-103 (2003)Maxime Crochemore, Costas S. Iliopoulos, Yoan J. Pinzon: Speeding-up Hirschberg and Hunt-Szymanski LCS Algorithms. Fundam. Inform. 56 (1-2): 89-103 (2003)

しかしながら、上述した従来技術のうち、挙動に基づいて分類する手法は、マルウェアそのものを解析対象とする必要がないため容易に実現可能だが、ボットのように攻撃者からの指令なしには動作しないマルウェアについては、挙動を確認することが困難であるため、分類することができないという問題があった。 However, among the above-mentioned conventional techniques, the method of classifying based on behavior is easily realized because the malware itself does not need to be analyzed, but it does not operate without instructions from an attacker like a bot. For, there was a problem that it was difficult to classify because it was difficult to confirm the behavior.

また、プログラムコードに基づいて分類する手法のうち、Ｎ−ｐｅｒｍｓを利用した手法には、Ｎが小さい場合は全く異なるマルウェアであっても類似度が高くなり、Ｎが大きい場合はわずかな差分であっても類似度に大きな影響を及ぼす可能性があるという問題があった。また、Ｎ−ｐｅｒｍｓの統計情報を比較することになるため、比較したマルウェアのどこが一致しどこが一致しなかったかを厳密に算出することが難しいという問題があった。 Of the methods classified based on the program code, the method using N-perms increases the similarity even if the malware is completely different when N is small, and with a slight difference when N is large. Even if it exists, there existed a problem of having a big influence on similarity. Further, since the statistical information of N-perms is compared, there is a problem that it is difficult to calculate exactly where the compared malware matches and where does not match.

また、コールツリーの類似度を用いた手法には、多大な計算量を要する上に、関数の呼び出し関係のみをマルウェアの特徴とするため、全く異なるマルウェアに関しても類似度が高くなる可能性があるという問題があった。 In addition, the method using the similarity of the call tree requires a large amount of calculation, and since only the function call relationship is a characteristic of the malware, the similarity may be high even for completely different malware. There was a problem.

本発明は、上記に鑑みてなされたものであって、マルウェア等の機械語命令列の類似性を少ない計算量で高精度に算出することができる類似性算出装置、類似性算出方法および類似性算出プログラムを提供することを目的とする。 The present invention has been made in view of the above, and a similarity calculation device, a similarity calculation method, and a similarity that can calculate the similarity of a machine language instruction sequence such as malware with a small amount of calculation with high accuracy An object is to provide a calculation program.

上述した課題を解決し、目的を達成するために、本発明は、複数の機械語命令列の類似性を算出する類似性算出装置であって、前記複数の機械語命令列毎に、機械語命令列に含まれる各機械語命令からオペランド部分を取り除いた縮約命令の配列である縮約命令列を生成する縮約命令列生成手段と、前記縮約命令列生成手段によって生成された縮約命令列を比較して、最長共通部分列を抽出する最長共通部分列抽出手段と、前記最長共通部分列抽出手段によって抽出された最長共通部分列に基づいて、前記機械語命令列の類似性を算出する類似性算出手段とを備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention is a similarity calculation device for calculating the similarity of a plurality of machine language instruction sequences, and for each of the plurality of machine language instruction sequences, a machine language Reduced instruction sequence generating means for generating a reduced instruction sequence that is an array of reduced instructions obtained by removing the operand portion from each machine language instruction included in the instruction sequence, and the reduction generated by the reduced instruction sequence generating means The longest common partial sequence extracting means for comparing the instruction sequences and extracting the longest common partial sequence, and the similarity of the machine language instruction sequence based on the longest common partial sequence extracted by the longest common partial sequence extracting means. And a similarity calculating means for calculating.

また、他の態様において、本発明は、複数の機械語命令列の類似性を算出する類似性算出方法であって、前記複数の機械語命令列毎に、機械語命令列に含まれる各機械語命令からオペランド部分を取り除いた縮約命令の配列である縮約命令列を生成する縮約命令列生成工程と、前記縮約命令列生成工程において生成された縮約命令列を比較して、最長共通部分列を抽出する最長共通部分列抽出工程と、前記最長共通部分列抽出工程において抽出された最長共通部分列に基づいて、前記機械語命令列の類似性を算出する類似性算出工程とを含んだことを特徴とする。 In another aspect, the present invention provides a similarity calculation method for calculating a similarity between a plurality of machine language instruction sequences, wherein each machine included in a machine language instruction sequence is included in each of the plurality of machine language instruction sequences. A reduced instruction sequence generating step for generating a reduced instruction sequence that is an array of reduced instructions obtained by removing an operand part from a word instruction and the reduced instruction sequence generated in the reduced instruction sequence generating step; A longest common subsequence extraction step for extracting the longest common subsequence, and a similarity calculation step for calculating the similarity of the machine language instruction sequence based on the longest common subsequence extracted in the longest common subsequence extraction step; It is characterized by including.

また、他の態様において、本発明は、複数の機械語命令列の類似性を算出する類似性算出プログラムであって、前記複数の機械語命令列毎に、機械語命令列に含まれる各機械語命令からオペランド部分を取り除いた縮約命令の配列である縮約命令列を生成する縮約命令列生成手順と、前記縮約命令列生成手順によって生成された縮約命令列を比較して、最長共通部分列を抽出する最長共通部分列抽出手順と、前記最長共通部分列抽出手順によって抽出された最長共通部分列に基づいて、前記機械語命令列の類似性を算出する類似性算出手順とをコンピュータに実行させることを特徴とする。 In another aspect, the present invention provides a similarity calculation program for calculating similarity between a plurality of machine language instruction sequences, wherein each machine included in a machine language instruction sequence is included in each of the plurality of machine language instruction sequences. A reduced instruction string generation procedure that generates a reduced instruction string that is an array of reduced instructions obtained by removing operand parts from word instructions and the reduced instruction string generated by the reduced instruction string generation procedure; A longest common subsequence extraction procedure for extracting the longest common subsequence, and a similarity calculation procedure for calculating the similarity of the machine language instruction sequence based on the longest common subsequence extracted by the longest common subsequence extraction procedure; Is executed by a computer.

本発明にかかる類似性算出装置、類似性算出方法および類似性算出プログラムは、機械語命令列の類似性を少ない計算量で高精度に算出することができるという効果を奏する。 The similarity calculation device, the similarity calculation method, and the similarity calculation program according to the present invention have an effect that the similarity of the machine language instruction sequence can be calculated with a small amount of calculation with high accuracy.

図１は、実施例１に係る類似性算出装置の構成を示すブロック図である。FIG. 1 is a block diagram illustrating the configuration of the similarity calculation apparatus according to the first embodiment. 図２は、縮約命令の構成の例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of the contracted instruction. 図３は、類似度行列データの一例を示す図である。FIG. 3 is a diagram illustrating an example of similarity matrix data. 図４は、実施例１に係る類似性算出装置の動作を示すフローチャートである。FIG. 4 is a flowchart illustrating the operation of the similarity calculation apparatus according to the first embodiment. 図５は、実施例２に係る類似性算出装置の構成を示すブロック図である。FIG. 5 is a block diagram illustrating the configuration of the similarity calculation apparatus according to the second embodiment. 図６は、差分解析結果データの一例を示す図である。FIG. 6 is a diagram illustrating an example of difference analysis result data. 図７は、実施例２に係る類似性算出装置の動作を示すフローチャートである。FIG. 7 is a flowchart illustrating the operation of the similarity calculation apparatus according to the second embodiment. 図８は、類似性算出プログラムを実行するコンピュータを示す機能ブロック図である。FIG. 8 is a functional block diagram illustrating a computer that executes the similarity calculation program.

以下に、本発明にかかる類似性算出装置、類似性算出方法および類似性算出プログラムの実施例を図面に基づいて詳細に説明する。なお、以下の実施例では、本発明にかかる類似性算出装置、類似性算出方法および類似性算出プログラムをマルウェアの類似性を算出するために用いる場合について説明するが、これらの実施例によりこの発明が限定されるものではない。 Embodiments of a similarity calculation device, a similarity calculation method, and a similarity calculation program according to the present invention will be described below in detail with reference to the drawings. In the following embodiments, the case where the similarity calculation device, the similarity calculation method, and the similarity calculation program according to the present invention are used for calculating the similarity of malware will be described. Is not limited.

まず、本実施例に係る類似性算出装置１０の構成について説明する。図１は、類似性算出装置１０の構成を示すブロック図である。図１に示すように、類似性算出装置１０は、表示部１１と、入力部１２と、記憶部１３と、制御部１４とを有する。 First, the configuration of the similarity calculation apparatus 10 according to the present embodiment will be described. FIG. 1 is a block diagram illustrating a configuration of the similarity calculation apparatus 10. As illustrated in FIG. 1, the similarity calculation device 10 includes a display unit 11, an input unit 12, a storage unit 13, and a control unit 14.

表示部１１は、例えば、液晶表示装置や有機ＥＬ（Electro-Luminescence）表示装置であり、各種情報を利用者に対して表示する。入力部１２は、例えば、キーボードやマウスを含み、利用者からの指示を受け付ける。なお、表示部１１や入力部１２は、必須の構成要素ではなく、例えば、利用者からの指示をネットワーク経由で受け付け、受け付けた指示に対する応答をネットワーク経由で利用者へ送信するように類似性算出装置１０を構成してもよい。 The display unit 11 is, for example, a liquid crystal display device or an organic EL (Electro-Luminescence) display device, and displays various types of information to the user. The input unit 12 includes, for example, a keyboard and a mouse, and receives instructions from the user. Note that the display unit 11 and the input unit 12 are not essential components. For example, the similarity calculation is performed so that an instruction from the user is received via the network and a response to the received instruction is transmitted to the user via the network. The apparatus 10 may be configured.

記憶部１３は、例えば、ハードディスク装置や半導体メモリであり、実行モジュール１３ａ等の各種の電子情報を記憶する。実行モジュール１３ａは、例えば、マルウェアであり、類似性の算出対象の機械語命令列や、機械語命令列によって利用されるデータ列等を含む。実行モジュール１３ａは、例えば、ネットワークやＤＶＤ等の記憶媒体を経由して記憶部１３に格納される。 The storage unit 13 is, for example, a hard disk device or a semiconductor memory, and stores various electronic information such as the execution module 13a. The execution module 13a is, for example, malware, and includes a machine language instruction sequence for which similarity is calculated, a data sequence used by the machine language instruction sequence, and the like. The execution module 13a is stored in the storage unit 13 via a storage medium such as a network or a DVD, for example.

記憶部１３は、制御部１４が中間データとして生成する縮約命令列１３ｂや、制御部１４が処理結果として生成する類似度行列データ１３ｃの格納場所としても利用される。 The storage unit 13 is also used as a storage location for the contracted instruction sequence 13b generated by the control unit 14 as intermediate data and the similarity matrix data 13c generated by the control unit 14 as a processing result.

制御部１４は、類似性算出装置１０を全体制御する制御部であり、アンパッキング部１４ａと、逆アセンブル部１４ｂと、縮約命令列生成部１４ｃと、最長共通部分列抽出部１４ｄと、類似性算出部１４ｅとを有する。 The control unit 14 is a control unit that controls the similarity calculating apparatus 10 as a whole, and is similar to the unpacking unit 14a, the disassembly unit 14b, the reduced instruction sequence generation unit 14c, and the longest common partial sequence extraction unit 14d. And a sex calculation unit 14e.

アンパッキング部１４ａは、記憶部１３に記憶されている各実行モジュール１３ａにアンパッキング処理を施して逆アセンブル部１４ｂへ出力する。マルウェアの多くは、解析を困難にするためにパッキングと呼ばれる処理が適用され、オリジナルの機械語命令列が隠蔽されている。アンパッキング部１４ａは、実行モジュール１３ａにパッキングが施されている場合、既存のアンパッキング技術を利用して、オリジナルの機械語命令列を再現する。なお、実行モジュール１３ａにパッキングが施されていない場合、アンパッキング部１４ａは、実行モジュール１３ａをそのまま逆アセンブル部１４ｂへ出力する。 The unpacking unit 14a performs an unpacking process on each execution module 13a stored in the storage unit 13 and outputs it to the disassembly unit 14b. In many malware, a process called packing is applied to make analysis difficult, and the original machine language instruction sequence is concealed. When the execution module 13a is packed, the unpacking unit 14a reproduces the original machine language instruction sequence by using an existing unpacking technique. When the execution module 13a is not packed, the unpacking unit 14a outputs the execution module 13a as it is to the disassembly unit 14b.

逆アセンブル部１４ｂは、アンパッキング部１４ａから入力された実行モジュール１３ａを逆アセンブルし、逆アセンブルした機械語命令列を縮約命令列生成部１４ｃへ出力する。上述したように、実行モジュール１３ａには、機械語命令列以外に処理対象のデータ列等が含まれるが、逆アセンブル部１４ｂは、機械語命令列の逆アセンブル結果のみを縮約命令列生成部１４ｃへ出力する。なお、実行モジュール１３ａに含まれる機械語命令列の選別は、例えば、特許文献１にて開示されている技術を利用して実現することができる。 The disassembly unit 14b disassembles the execution module 13a input from the unpacking unit 14a, and outputs the disassembled machine language instruction sequence to the contracted instruction sequence generation unit 14c. As described above, the execution module 13a includes a data string to be processed in addition to the machine language instruction sequence, but the disassembly unit 14b is a reduced instruction sequence generation unit that generates only the disassembly result of the machine language instruction sequence. To 14c. The selection of the machine language instruction sequence included in the execution module 13a can be realized using, for example, the technique disclosed in Patent Document 1.

縮約命令列生成部１４ｃは、逆アセンブル部１４ｂから入力された機械語命令列から縮約命令列を生成する。ここで、縮約命令とは、機械語命令からオペランド部分を削除した命令をいい、縮約命令列とは、機械語命令列に含まれる各機械語命令から変換された縮約命令の配列をいう。 The contracted instruction sequence generation unit 14c generates a contracted instruction sequence from the machine language instruction sequence input from the disassemble unit 14b. Here, the contracted instruction is an instruction in which the operand part is deleted from the machine language instruction, and the contracted instruction sequence is an array of contracted instructions converted from each machine language instruction included in the machine language instruction sequence. Say.

例えば、機械語命令が分岐命令の場合、機械語命令のオペランド部分には分岐先情報が含まれる。そして、分岐先情報は、マルウェア作者によってマルウェアの亜種が作成された場合のように、実行モジュールが改変された場合に、分岐元と分岐先の間に新たな命令が追加される等して変化してしまうことがある。 For example, when the machine language instruction is a branch instruction, the operand part of the machine language instruction includes branch destination information. The branch destination information is obtained by adding a new instruction between the branch source and the branch destination when the execution module is modified, such as when a malware variant is created by the malware author. It may change.

また、メモリアクセスに必要となる絶対アドレスもオペランドとして指定されるが、実行モジュールが動的リンクライブラリとして実装されている場合、ロードされるアドレスが一定ではなく、当該ライブラリがロードされるタイミングによっては絶対アドレスが変化する。 The absolute address required for memory access is also specified as an operand, but when the execution module is implemented as a dynamic link library, the address to be loaded is not constant, and depending on the timing at which the library is loaded Absolute address changes.

このように、機械語命令のオペランド部分の内容は、機械語命令列が実質的には改変されていない場合でも変化することがある。このため、類似性の算出にオペランド部分を含んだ機械語命令列を用いると、機械語命令列のうち実質的には改変されていない部分までも相違部分と判断されるおそれがある。 As described above, the contents of the operand portion of the machine language instruction may change even when the machine language instruction sequence is not substantially modified. For this reason, when a machine language instruction sequence including an operand part is used for calculating the similarity, even a part of the machine language instruction string that is not substantially modified may be determined as a different part.

そこで、本実施例に係る類似性算出方法では、オペランド部分が除かれた縮約命令列を用いて類似性が算出される。オペランド部分が除かれた縮約命令列を用いて類似性を算出することにより、オペランド部分の内容の変化の影響をうけることなく、機械語命令列の類似性を高精度に算出することが可能になる。 Therefore, in the similarity calculation method according to the present embodiment, the similarity is calculated using the reduced instruction sequence from which the operand portion is removed. By calculating the similarity using the reduced instruction sequence from which the operand part is removed, the similarity of the machine language instruction sequence can be calculated with high accuracy without being affected by the change in the contents of the operand part. become.

ここで、縮約命令列生成部１４ｃが機械語命令から変換して生成する縮約命令について、ＩＡ−３２命令セットの場合を例としてさらに詳しく説明する。ＩＡ−３２命令セットにおける機械語命令は、命令の修飾語となるプレフィックス部、命令の種類を表すオペコード部、オペランドの型を表すＭｏｄ／ＲＭとＳＩＢ、オペランドがメモリ上である場合のアドレス部、オペランドが即値であった場合の即値部から構成される。 Here, the contracted instruction generated by converting the contracted instruction sequence generation unit 14c from the machine language instruction will be described in more detail by taking the case of the IA-32 instruction set as an example. The machine language instruction in the IA-32 instruction set includes a prefix part that is an instruction modifier, an opcode part that represents an instruction type, Mod / RM and SIB that represent an operand type, and an address part when the operand is in memory. Consists of an immediate part when the operand is an immediate value.

ＩＡ−３２命令セットにおけるプレフィックスは４つのグループに分けられ、グループ１には３通り、グループ２には６通り、グループ３には１通り、グループ４には１通りのプレフィックスが存在し、各グループで一つ以下のプレフィックスが選択される。また、オペコードに関しては、１バイト目が０ｘ０Ｆ以外の場合はその値、０ｘ０Ｆの場合は２バイト目の値が実質的なオペコードとなる。オペコードが決まると、オペランドの有無、Ｍｏｄ／ＲＭの有無、即値の有無が確定する。さらに、Ｍｏｄ／ＲＭの値によってＳＩＢの有無が確定し、ＳＩＢの値によってアドレス部の有無が確定する。 The prefixes in the IA-32 instruction set are divided into four groups. There are three types of prefixes in group 1, six types in group 2, one type in group 3, and one type in group 4. Will select less than one prefix. Regarding the operation code, when the first byte is other than 0x0F, the value is the value, and when the first byte is 0x0F, the value of the second byte is the substantial operation code. When the opcode is determined, the presence / absence of an operand, the presence / absence of Mod / RM, and the presence / absence of an immediate value are determined. Further, the presence / absence of an SIB is determined by the value of Mod / RM, and the presence / absence of an address portion is determined by the value of SIB.

本実施例では、縮約命令として、プレフィックス部、オペコード部、Ｍｏｄ／ＲＭとＳＩＢの情報を組み合わせて利用する。本実施例における縮約命令の構成の例を図２に示す。図２に示す例では、縮約命令は、２ビットのＰ１と、３ビットのＰ２と、１ビットのＰ３、Ｐ４およびＯＬと、８ビットのＯＣ、ＭおよびＳとからなる。 In the present embodiment, a prefix instruction, an opcode part, Mod / RM and SIB information are used in combination as contraction instructions. An example of the configuration of the contracted instruction in this embodiment is shown in FIG. In the example shown in FIG. 2, the contracted instruction includes 2-bit P1, 3-bit P2, 1-bit P3, P4, and OL, and 8-bit OC, M, and S.

Ｐ１は、グループ１のプレフィックスに対応する。具体的には、Ｐ１は、グループ１のプレフィックスが存在しない場合は「０」となり、グループ１のプレフィックスの値が「Ｆ０Ｈ」、「Ｆ２Ｈ」、「Ｆ３Ｈ」の場合はそれぞれ「１」、「２」、「３」となる。Ｐ２は、グループ２のプレフィックスに対応する。具体的には、Ｐ２は、グループ２のプレフィックスが存在しない場合は「０」となり、グループ２のプレフィックスの値が「２ＥＨ」、「３６Ｈ」、「３ＥＨ」、「２６Ｈ」、「６４Ｈ」、「６５Ｈ」の場合はそれぞれ「１」、「２」、「３」、「４」、「５」、「６」となる。 P1 corresponds to the prefix of group 1. Specifically, P1 is “0” when there is no group 1 prefix, and “1” and “2” when the group 1 prefix value is “F0H”, “F2H”, and “F3H”, respectively. "," 3 ". P2 corresponds to the prefix of group 2. Specifically, P2 is “0” when the group 2 prefix does not exist, and the group 2 prefix values are “2EH”, “36H”, “3EH”, “26H”, “64H”, “ In the case of “65H”, they are “1”, “2”, “3”, “4”, “5”, and “6”, respectively.

Ｐ３は、グループ３のプレフィックスに対応する。具体的には、Ｐ３は、グループ３のプレフィックスが存在しない場合は「０」となり、グループ３のプレフィックスの値が「６６Ｈ」の場合は「１」となる。Ｐ４は、グループ４のプレフィックスに対応する。具体的には、Ｐ４は、グループ４のプレフィックスが存在しない場合は「０」となり、グループ４のプレフィックスの値が「６７Ｈ」の場合は「１」となる。 P3 corresponds to the prefix of group 3. Specifically, P3 is “0” when there is no group 3 prefix, and “1” when the group 3 prefix value is “66H”. P4 corresponds to the prefix of group 4. Specifically, P4 is “0” when there is no group 4 prefix and “1” when the group 4 prefix value is “67H”.

ＯＬは、オペコードの１バイト目が０ｘ０Ｆであるかを示し、オペコードの１バイト目が０ｘ０Ｆであれば「１」となり、さもなければ「０」となる。ＯＣは、オペコードの実質的な値であり、オペコードの１バイト目が０ｘ０Ｆであればオペコードの２バイト目の値が設定され、さもなければオペコードの１バイト目の値が設定される。 OL indicates whether the first byte of the operation code is 0x0F. If the first byte of the operation code is 0x0F, “1” is set. Otherwise, “0” is set. OC is a substantial value of the operation code. If the first byte of the operation code is 0x0F, the value of the second byte of the operation code is set. Otherwise, the value of the first byte of the operation code is set.

Ｍは、ＭｏｄＲ／Ｍに対応し、ＭｏｄＲ／Ｍが存在する場合はＭｏｄＲ／Ｍの値が設定され、さもなければ「０」が設定される。Ｓは、ＳＩＢに対応し、ＳＩＢが存在する場合はＳＩＢの値が設定され、さもなければ「０」が設定される。 M corresponds to ModR / M. If ModR / M exists, the value of ModR / M is set, otherwise, “0” is set. S corresponds to the SIB, and if the SIB exists, the SIB value is set; otherwise, “0” is set.

ＩＡ−３２命令セットにおける機械語命令は、命令の種別等によって長さが異なるが、図２に示すように、本実施例における縮約命令の長さは、命令の種類によらずに３２ビットの固定長となる。３２ビットというサイズは、ＩＡ−３２における汎用レジスタのビット幅と同じであり、縮約命令列を効率よく処理するのに好適である。また、縮約命令列を固定長の要素からなる配列として形成することは、後述するビットベクトル化アルゴリズムを利用した共通部分の抽出処理を容易に実現するために好適である。 The machine language instructions in the IA-32 instruction set have different lengths depending on the instruction type and the like, but as shown in FIG. 2, the length of the reduced instruction in this embodiment is 32 bits regardless of the instruction type. It becomes a fixed length. The size of 32 bits is the same as the bit width of the general-purpose register in IA-32, and is suitable for efficiently processing the reduced instruction sequence. Also, forming the contracted instruction sequence as an array of fixed-length elements is suitable for easily realizing the common part extraction process using a bit vectorization algorithm described later.

最長共通部分列抽出部１４ｄは、縮約命令列生成部１４ｃによって生成された縮約命令列の全ての組合せについて、最長共通部分列（Longest Common Subsequence）を抽出する。例えば、｛ａ，ｂ，ｃ，ｄ，ｅ｝という５つの命令からなる縮約命令列と、｛ｆ，ｂ，ｇ，ｄ，ｈ｝という４つの命令からなる縮約命令列があった場合、これら縮約命令列の共通部分列は｛ｂ，ｄ｝である。縮約命令列からの最長共通部分列の抽出は、例えば、動的計画法および分割統治法に基づく計算量ｏ（ｍｎ）、メモリ使用量ｏ（ｎ）のアルゴリズム（非特許文献４参照）を利用することにより、少ない計算量で実現することができる。また、このアルゴリズムにビットベクトル化と呼ばれる手法を適用することで計算機における演算単位ビット数倍の高速化を達成するアルゴリズム（非特許文献５参照）を利用することもできる。 The longest common subsequence extraction unit 14d extracts the longest common subsequence for all combinations of the reduced instruction sequences generated by the reduced instruction sequence generation unit 14c. For example, when there is a reduced instruction sequence consisting of five instructions {a, b, c, d, e} and a reduced instruction sequence consisting of four instructions {f, b, g, d, h} The common partial sequence of these contracted instruction sequences is {b, d}. The extraction of the longest common subsequence from the contracted instruction sequence is performed, for example, by using an algorithm of calculation amount o (mn) and memory usage amount o (n) based on dynamic programming and divide-and-conquer (see Non-Patent Document 4). By using it, it can be realized with a small amount of calculation. In addition, by applying a technique called bit vectorization to this algorithm, an algorithm (see Non-Patent Document 5) that achieves a speed increase of the number of arithmetic unit bits in a computer can be used.

なお、ビットベクトル化を適用するアルゴリズムには、メモリ使用量が比較要素のアルファベットサイズをσとしたときｏ（σｎ）となるため、σが非常に大きくなるとビットベクトル化を適用することが困難になるという問題があることが知られている。本実施例に係る類似性算出方法では、比較要素として機械語命令からオペランドを取り除いた縮約命令を用いており、機械語命令を比較要素とする場合と比べてσを小さくすることができるため、ビットベクトル化を適用した際にもメモリ使用量を少なく抑えることが可能になっている。 Note that the algorithm for applying bit vectorization is o (σn) when the memory usage is σ when the alphabet size of the comparison element is σ. Therefore, it becomes difficult to apply bit vectorization when σ is very large. It is known that there is a problem of becoming. In the similarity calculation method according to the present embodiment, a contracted instruction obtained by removing an operand from a machine language instruction is used as a comparison element, and σ can be reduced compared to the case where a machine language instruction is used as a comparison element. Even when bit vectorization is applied, it is possible to reduce the memory usage.

類似性算出部１４ｅは、最長共通部分列抽出部１４ｄによって抽出された最長共通部分列の長さに基づいて全ての実行モジュール１３ａ（機械語命令列）の組合せについて類似性を算出し、算出結果を類似度行列データ１３ｃとして出力する。 The similarity calculation unit 14e calculates similarity for all combinations of execution modules 13a (machine language instruction sequences) based on the length of the longest common partial sequence extracted by the longest common partial sequence extraction unit 14d, and the calculation result Is output as similarity matrix data 13c.

２つの機械語命令列をＡ、Ｂとし、それぞれに対応する縮約命令列をＣＡ、ＣＢとし、それぞれ縮約命令列の長さをＬ（ＣＡ）、Ｌ（ＣＢ）とすると、ＣＡとＣＢの最長共通部分列ＬＣＳ（ＣＡ，ＣＢ）は、最長共通部分列抽出部１４ｄによって抽出される。そして、最長共通部分列ＬＣＳ（ＣＡ，ＣＢ）の長さをＬＬＣＳ（ＣＡ，ＣＢ）とすると、類似性算出部１４ｅは、以下の式（１）を用いて、０から１の値をとり類似比率を意味するＪａｃｃａｒｄ係数を計算することによって、機械語命令列Ａと機械語命令列Ｂの類似性を示す類似度を算出する。 Assuming that two machine language instruction sequences are A and B, the corresponding reduced instruction sequences are CA and CB, and the lengths of the reduced instruction sequences are L (CA) and L (CB), respectively, CA and CB The longest common subsequence LCS (CA, CB) is extracted by the longest common subsequence extraction unit 14d. Then, assuming that the length of the longest common subsequence LCS (CA, CB) is LLCS (CA, CB), the similarity calculation unit 14e takes a value from 0 to 1 using the following equation (1) and is similar By calculating a Jaccard coefficient representing a ratio, a similarity indicating the similarity between the machine language instruction sequence A and the machine language instruction sequence B is calculated.

なお、Ｊａｃｃａｒｄ係数を用いる手法は、最長共通部分列の長さに基づいて類似性を算出する手法の一例であり、最長共通部分列の長さに基づいて類似性を算出するものであれば他の手法を用いてもよい。例えば、最長共通部分列の長さそのものを類似性の指標として用いることとしてもよい。 Note that the method using the Jaccard coefficient is an example of a method for calculating similarity based on the length of the longest common subsequence, and other methods may be used as long as the similarity is calculated based on the length of the longest common subsequence. You may use the method of. For example, the length of the longest common subsequence itself may be used as an index of similarity.

類似性算出部１４ｅが出力する類似度行列データ１３ｃの一例を図３に示す。図３に示す例は、機械語命令列Ａ〜Ｄの４つの機械語命令列を対象として類似性を算出した場合の類似度行列データ１３ｃの例であり、機械語命令列Ａ〜Ｄの全ての組合せについて類似性がＪａｃｃａｒｄ係数として算出されている。 An example of the similarity matrix data 13c output from the similarity calculation unit 14e is shown in FIG. The example shown in FIG. 3 is an example of the similarity matrix data 13c when similarity is calculated for four machine language instruction sequences of the machine language instruction sequences A to D. All of the machine language instruction sequences A to D are illustrated. Similarity is calculated as a Jaccard coefficient for the combinations.

なお、このように出力された類似度行列データ１３ｃに基づいて、さらに、機械語命令列をクラスタリングすることとしてもよい。類似性に基づくクラスタリングを行うことにより、例えば、類似性を算出する対象の機械語命令列がマルウェアであれば、マルウェアの流行や廃りを効率的に把握したり、未知のマルウェアが出現した場合にそのマルウェアに最も類似しているマルウェアを判別したりすることが容易になる。 The machine language instruction sequence may be further clustered based on the similarity matrix data 13c output in this way. By performing clustering based on similarity, for example, if the machine language instruction sequence for which similarity is calculated is malware, it is possible to efficiently grasp the fashion and disuse of malware, or when unknown malware appears It becomes easy to determine the most similar malware.

なお、類似度行列データ１３ｃについては、表形式やグラフ等による形式で類似度行列データ１３ｃを表示部１１や図示しない印刷装置にできるように類似性算出装置１０を構成することとしてもよいし、ネットワークや記憶媒体を介して類似度行列データ１３ｃを他の装置へ転送できるように類似性算出装置１０を構成することとしてもよい。 As for the similarity matrix data 13c, the similarity calculation device 10 may be configured so that the similarity matrix data 13c can be used in a display unit 11 or a printing device (not shown) in the form of a table or a graph. The similarity calculation device 10 may be configured so that the similarity matrix data 13c can be transferred to another device via a network or a storage medium.

次に、図１に示した類似性算出装置１０の動作について、図４に示したフローチャートを参照しながら説明する。なお、ここでは、類似性を算出する対象である全ての実行モジュール１３ａが予め記憶部１３に記憶されているものとする。 Next, the operation of the similarity calculation apparatus 10 shown in FIG. 1 will be described with reference to the flowchart shown in FIG. Here, it is assumed that all the execution modules 13a for which similarity is calculated are stored in the storage unit 13 in advance.

図３に示すように、アンパッキング部１４ａが、記憶部１３に記憶されている実行モジュール１３ａのうち未選択のものを１つ選択する（ステップＳ１０１）。そして、選択できた場合（ステップＳ１０２否定）、アンパッキング部１４ａは、選択した実行モジュール１３ａがパッキングされていれば（ステップＳ１０３肯定）、その実行モジュール１３ａにアンパッキング処理を施して、逆アセンブル部１４ｂへ出力する（ステップＳ１０４）。一方、選択した実行モジュール１３ａがパッキングされていなければ（ステップＳ１０３否定）、アンパッキング部１４ａは、その実行モジュール１３ａをそのまま逆アセンブル部１４ｂへ出力する。 As shown in FIG. 3, the unpacking unit 14a selects one of the execution modules 13a stored in the storage unit 13 that has not been selected (step S101). If it can be selected (No at Step S102), the unpacking unit 14a performs an unpacking process on the execution module 13a if the selected execution module 13a is packed (Yes at Step S103), and disassembles the unit. 14b (step S104). On the other hand, if the selected execution module 13a is not packed (No at Step S103), the unpacking unit 14a outputs the execution module 13a as it is to the disassembly unit 14b.

逆アセンブル部１４ｂは、入力された実行モジュール１３ａを逆アセンブルして機械語命令列を抽出する（ステップＳ１０５）。続いて、縮約命令列生成部１４ｃは、抽出された機械語命令列から縮約命令列１３ｂを生成する（ステップＳ１０６）。こうして、ステップＳ１０１で選択された実行モジュール１３ａに対応する縮約命令列１３ｂが生成された後、ステップＳ１０１から処理手順が再開され、アンパッキング部１４ａが、記憶部１３に記憶されている実行モジュール１３ａのうち未選択のものの選択を試みる。 The disassembly unit 14b disassembles the input execution module 13a to extract a machine language instruction string (step S105). Subsequently, the contracted instruction sequence generation unit 14c generates a contracted instruction sequence 13b from the extracted machine language instruction sequence (step S106). Thus, after the contracted instruction sequence 13b corresponding to the execution module 13a selected in step S101 is generated, the processing procedure is restarted from step S101, and the unpacking unit 14a is stored in the storage unit 13. Attempts to select an unselected one among 13a.

そして、ステップＳ１０１において全ての実行モジュール１３ａを選択済であった場合（ステップＳ１０２肯定）、最長共通部分列抽出部１４ｄが、縮約命令列１３ｂの組合せのうち未選択のものを１つ選択する（ステップＳ１０７）。選択できた場合（ステップＳ１０８否定）、最長共通部分列抽出部１４ｄが、選択した縮約命令列１３ｂの組合せから最長共通部分列を抽出する（ステップＳ１０９）。そして、類似性算出部１４ｅが、抽出された最長共通部分列に基づいて、ステップＳ１０７で選択された縮約命令列１３ｂの組合せに対応する機械語命令列の組合せの類似性を表す類似度を算出する（ステップＳ１１０）。 If all execution modules 13a have been selected in step S101 (Yes in step S102), the longest common partial sequence extraction unit 14d selects one of the unselected combinations of the reduced instruction sequence 13b. (Step S107). If it can be selected (No at Step S108), the longest common subsequence extraction unit 14d extracts the longest common subsequence from the selected combination of the contracted instruction sequences 13b (Step S109). Then, the similarity calculation unit 14e calculates a similarity indicating the similarity of the combination of machine language instruction sequences corresponding to the combination of the contracted instruction sequence 13b selected in step S107 based on the extracted longest common subsequence. Calculate (step S110).

こうして、ステップＳ１０７で選択された縮約命令列１３ｂの組合せに対応する機械語命令列の組合せの類似度が算出された後、ステップＳ１０７から処理手順が再開され、最長共通部分列抽出部１４ｄが、縮約命令列１３ｂの組合せのうち未選択のものの選択を試みる。そして、ステップＳ１０７において全ての組合せが選択済であった場合（ステップＳ１０８肯定）、類似性算出部１４ｅが、それまでに算出した各類似度から類似度行列データ１３ｃを生成し（ステップＳ１１１）、一連の処理が終了する。 Thus, after the similarity of the combination of machine language instruction sequences corresponding to the combination of the reduced instruction sequence 13b selected in step S107 is calculated, the processing procedure is restarted from step S107, and the longest common subsequence extraction unit 14d Then, an attempt is made to select an unselected one among the combinations of the contracted instruction sequence 13b. If all combinations have been selected in step S107 (Yes in step S108), the similarity calculation unit 14e generates similarity matrix data 13c from each similarity calculated so far (step S111). A series of processing ends.

上述してきたように、本実施例では、機械語命令列から各機械語命令のオペランド部分を除いた縮約命令列に基づいて機械語命令列の類似性を算出することとしたので、機械語命令列の類似性を少ない計算量で高精度に算出することができる。 As described above, in this embodiment, since the similarity of the machine language instruction sequence is calculated based on the reduced instruction sequence obtained by removing the operand part of each machine language instruction from the machine language instruction sequence, the machine language The similarity of instruction sequences can be calculated with a small amount of calculation with high accuracy.

実施例１では、機械語命令列から縮約命令列を生成し、縮約命令列から抽出した最長共通部分列の長さに基づいて機械語命令列の類似性を算出することとしたが、縮約命令列を用いて機械語命令列の類似性を解析する方式は、これに限定されない。そこで、実施例２では、縮約命令列を用いて機械語命令列の類似性を解析する他の方式の例について説明する。なお、以下の説明では、既に説明した部位と同様の部位については、既に説明した部位と同一の符号を付して、重複する説明を省略する。 In the first embodiment, a reduced instruction sequence is generated from the machine language instruction sequence, and the similarity of the machine language instruction sequence is calculated based on the length of the longest common subsequence extracted from the reduced instruction sequence. The method of analyzing the similarity between machine language instruction sequences using the reduced instruction sequence is not limited to this. Therefore, in the second embodiment, an example of another method for analyzing the similarity of machine language instruction sequences using a reduced instruction sequence will be described. In the following description, the same parts as those already described are denoted by the same reference numerals as those already described, and redundant description is omitted.

まず、本実施例に係る類似性算出装置２０の構成について説明する。図５は、類似性算出装置２０の構成を示すブロック図である。図５に示すように、類似性算出装置２０は、表示部１１と、入力部１２と、記憶部２３と、制御部２４とを有する。 First, the configuration of the similarity calculation apparatus 20 according to the present embodiment will be described. FIG. 5 is a block diagram illustrating a configuration of the similarity calculation apparatus 20. As illustrated in FIG. 5, the similarity calculation device 20 includes a display unit 11, an input unit 12, a storage unit 23, and a control unit 24.

記憶部２３は、制御部１４が処理結果として生成する類似度行列データ１３ｃではなく、制御部２４が処理結果として生成する差分解析結果データ２３ｃの格納場所として利用される点において、図１に示した記憶部１３と相違する。 The storage unit 23 is shown in FIG. 1 in that the storage unit 23 is used as a storage location of the difference analysis result data 23c generated as the processing result by the control unit 24, not the similarity matrix data 13c generated as the processing result by the control unit 14. This is different from the storage unit 13.

制御部２４は、類似性算出部１４ｅに代えて類似性算出部２４ｅを有する点において、図１に示した制御部１４と相違する。類似性算出部２４ｅは、最長共通部分列抽出部１４ｄによって抽出された各最長共通部分列を、その最長共通部分列の抽出元である２つの縮約命令列のそれぞれと比較して、それぞれの縮約命令列に固有の命令を抽出し、差分解析結果データ２３ｃを生成する。そのようにそれぞれの縮約命令列に固有の命令を抽出することにより、例えば、改変が加えられたマルウェアについて、改変された箇所に注目して解析を行うことが容易になる。 The control unit 24 is different from the control unit 14 shown in FIG. 1 in that a similarity calculation unit 24e is provided instead of the similarity calculation unit 14e. The similarity calculation unit 24e compares each longest common partial sequence extracted by the longest common partial sequence extraction unit 14d with each of the two contracted instruction sequences from which the longest common partial sequence is extracted. An instruction specific to the contracted instruction sequence is extracted, and difference analysis result data 23c is generated. By extracting an instruction specific to each contracted instruction sequence in this way, for example, it is easy to analyze a malware that has been modified by paying attention to the modified location.

２つの機械語命令列をＡ、Ｂとし、それぞれに対応する縮約命令列をＣＡ、ＣＢとし、それぞれ縮約命令列の長さをＬ（ＣＡ）、Ｌ（ＣＢ）とすると、最長共通部分列抽出部１４ｄによって抽出されるＣＡとＣＢの最長共通部分列をＬＣＳ（ＣＡ，ＣＢ）とする。この場合、類似性算出部２４ｅは、ＬＣＳ（ＣＡ，ＣＢ）とＣＡを先頭から順に比較することにより、ＣＡには存在するがＬＣＳ（ＣＡ，ＣＢ）に存在しない縮約命令と、ＣＡにもＬＣＳ（ＣＡ，ＣＢ）にも存在する縮約命令を特定する。このうち、前者がＣＡに固有の命令に相当する。また、類似性算出部２４ｅは、ＬＣＳ（ＣＡ，ＣＢ）とＣＢを先頭から順に比較することにより、ＣＢには存在するがＬＣＳ（ＣＡ，ＣＢ）に存在しない縮約命令と、ＣＢにもＬＣＳ（ＣＡ，ＣＢ）にも存在する縮約命令を特定する。このうち、前者がＣＢに固有の命令に相当する。 If the two machine language instruction sequences are A and B, the corresponding reduced instruction sequences are CA and CB, and the lengths of the reduced instruction sequences are L (CA) and L (CB), respectively, the longest common part The longest common partial sequence of CA and CB extracted by the column extraction unit 14d is LCS (CA, CB). In this case, the similarity calculation unit 24e compares LCS (CA, CB) and CA in order from the top, thereby reducing the reduced instruction that is present in CA but not in LCS (CA, CB) and CA. A contraction instruction that also exists in LCS (CA, CB) is specified. Of these, the former corresponds to an instruction specific to CA. Further, the similarity calculation unit 24e compares the LCS (CA, CB) and CB in order from the head, thereby reducing the reduced instruction that exists in the CB but does not exist in the LCS (CA, CB) and the LCS. A contraction instruction that also exists in (CA, CB) is specified. Of these, the former corresponds to an instruction specific to CB.

類似性算出部２４ｅが出力する差分解析結果データ２３ｃの一例を図６に示す。図６に示す例は、複数の機械語命令列のうち、機械語命令列Ａと機械語命令列Ｂを対象として差分を抽出した場合の差分解析結果データ２３ｃの例であり、ＸＭＬ形式で作成されている。なお、図６に示す例では、機械語命令列Ａと機械語命令列Ｂの差分以外に、最長共通部分列に含まれる命令が、機械語命令列Ａと機械語命令列Ｂの共通部分として出力されている。 An example of the difference analysis result data 23c output by the similarity calculation unit 24e is shown in FIG. The example shown in FIG. 6 is an example of the difference analysis result data 23c when a difference is extracted for a machine language instruction sequence A and a machine language instruction sequence B among a plurality of machine language instruction sequences, and is generated in an XML format. Has been. In the example shown in FIG. 6, in addition to the difference between the machine language instruction sequence A and the machine language instruction sequence B, the instruction included in the longest common partial sequence is the common part of the machine language instruction sequence A and the machine language instruction sequence B. It is output.

図６に示す例では、「機械語命令列１」というタグが、差分の抽出対象の一方が機械語命令列Ａであることを示しており、「機械語命令列２」というタグが、差分の抽出対象の他方が機械語命令列Ｂであることを示している。また、「固有縮約命令１」というタグは、機械語命令列Ａに固有の縮約命令を含むタグであり、「固有縮約命令２」というタグは、機械語命令列Ｂに固有の縮約命令を含むタグであり、「共通縮約命令」というタグは、機械語命令列Ａと機械語命令列Ｂに固有の縮約命令を含むタグである。「固有縮約命令１」、「固有縮約命令２」、「共通縮約命令」の各タグには、縮約命令列における当該縮約命令の行番号と、当該縮約命令の各項目の値を含んだ「縮約命令」というタグが０個以上含まれている。 In the example shown in FIG. 6, the tag “machine language instruction sequence 1” indicates that one of the difference extraction targets is the machine language instruction sequence A, and the tag “machine language instruction sequence 2” The other of the extraction targets is a machine language instruction sequence B. The tag “unique contract instruction 1” is a tag including a contract instruction unique to the machine language instruction sequence A, and the tag “unique contract instruction 2” is a contract specific to the machine language instruction sequence B. A tag including a contract instruction, and a tag “common contract instruction” includes a contract instruction unique to the machine language instruction sequence A and the machine language instruction sequence B. Each tag of “unique contract instruction 1”, “unique contract instruction 2”, and “common contract instruction” includes the line number of the contract instruction in the contract instruction sequence and each item of the contract instruction. Zero or more tags “contract instruction” including a value are included.

なお、図６に示す例では、差分として縮約命令の内容が出力されているが、縮約命令列内の各縮約命令を何らかの方式で変換元の機械語命令と予め関連付けておいたり、縮約命令列における縮約命令の行番号に基づいて縮約命令の変換元の機械語命令を動的に特定したりすることによって、差分として縮約命令の変換元の機械語命令の内容を出力することとしてもよい。 In the example shown in FIG. 6, the content of the contracted instruction is output as a difference, but each contracted instruction in the contracted instruction sequence is associated in advance with the machine language instruction of the conversion source in some way, By dynamically specifying the machine language instruction that is the conversion source of the reduced instruction based on the line number of the reduced instruction in the reduced instruction sequence, the contents of the machine language instruction that is the conversion source of the reduced instruction are obtained as a difference. It is good also as outputting.

また、差分解析結果データ２３ｃについては、表形式やグラフィカルな形式で差分解析結果データ２３ｃを表示部１１や図示しない印刷装置にできるように類似性算出装置２０を構成することとしてもよいし、ネットワークや記憶媒体を介して差分解析結果データ２３ｃを他の装置へ転送できるように類似性算出装置２０を構成することとしてもよい。 For the difference analysis result data 23c, the similarity calculation device 20 may be configured so that the difference analysis result data 23c can be displayed in a tabular format or a graphical format on the display unit 11 or a printing device (not shown). Alternatively, the similarity calculation device 20 may be configured such that the difference analysis result data 23c can be transferred to another device via a storage medium.

次に、図５に示した類似性算出装置２０の動作について、図７に示したフローチャートを参照しながら説明する。なお、ステップＳ２０１〜Ｓ２０６については、図４に示したフローチャートと同一の内容であるため、説明を省略する。 Next, the operation of the similarity calculation apparatus 20 shown in FIG. 5 will be described with reference to the flowchart shown in FIG. Steps S201 to S206 have the same contents as those in the flowchart shown in FIG.

ステップＳ２０１において全ての実行モジュール１３ａを選択済であった場合（ステップＳ２０２肯定）、最長共通部分列抽出部１４ｄが、縮約命令列１３ｂの組合せのうち未選択のものを１つ選択する（ステップＳ２０７）。選択できた場合（ステップＳ２０８否定）、最長共通部分列抽出部１４ｄが、選択した縮約命令列１３ｂの組合せから最長共通部分列を抽出する（ステップＳ２０９）。そして、類似性算出部２４ｅが、抽出された最長共通部分列に基づいて、ステップＳ２０７で選択された縮約命令列１３ｂの組合せの差分解析結果データ２３ｃを生成する（ステップＳ２１０）。 If all the execution modules 13a have been selected in step S201 (Yes in step S202), the longest common subsequence extraction unit 14d selects one of the combinations of the contracted instruction sequence 13b that has not been selected (step S202). S207). If it can be selected (No at Step S208), the longest common partial sequence extracting unit 14d extracts the longest common partial sequence from the selected combination of the contracted instruction sequences 13b (Step S209). Then, the similarity calculation unit 24e generates difference analysis result data 23c of the combination of the contracted instruction sequence 13b selected in step S207 based on the extracted longest common partial sequence (step S210).

こうして、ステップＳ２０７で選択された縮約命令列１３ｂの組合せの差分解析結果データ２３ｃが生成された後、ステップＳ２０７から処理手順が再開され、最長共通部分列抽出部１４ｄが、縮約命令列１３ｂの組合せのうち未選択のものの選択を試みる。そして、ステップＳ２０７において全ての組合せが選択済であった場合（ステップＳ２０８肯定）、一連の処理手順が完了する。 Thus, after the difference analysis result data 23c of the combination of the contracted instruction sequence 13b selected in step S207 is generated, the processing procedure is restarted from step S207, and the longest common partial sequence extracting unit 14d performs the contracted instruction sequence 13b. Attempts to select an unselected one of the combinations. If all combinations have been selected in step S207 (Yes in step S208), a series of processing procedures is completed.

上述してきたように、縮約命令列を用いることにより、縮約命令列から抽出した最長共通部分列の長さに基づいて機械語命令列の類似性を算出する以外にも、機械語命令列の類似性を多様に解析することができる。 As described above, by using the contracted instruction sequence, in addition to calculating the similarity of the machine language instruction sequence based on the length of the longest common subsequence extracted from the contracted instruction sequence, the machine language instruction sequence Can be analyzed in various ways.

なお、上述してきた各実施例の実施形式は、要旨を逸脱しない範囲で種々に変更することができる。例えば、各実施例の実施形式は、適宜組み合わせて実施することができる。また、図１に示した類似性算出装置１０の制御部１４や図２に示した類似性算出装置２０の制御部２４の機能をソフトウェアとして実装し、これをコンピュータで実行することにより、類似性算出装置１０や類似性算出装置２０と同等の機能を実現することもできる。以下に、類似性算出装置１０の制御部１４の機能をソフトウェアとして実装した類似性算出プログラム１７１を実行するコンピュータの一例を示す。 In addition, the implementation form of each Example mentioned above can be variously changed in the range which does not deviate from a summary. For example, the implementation forms of the embodiments can be implemented in combination as appropriate. Further, the functions of the control unit 14 of the similarity calculation device 10 shown in FIG. 1 and the control unit 24 of the similarity calculation device 20 shown in FIG. 2 are implemented as software, and this is executed by a computer. Functions equivalent to those of the calculation device 10 and the similarity calculation device 20 can also be realized. Hereinafter, an example of a computer that executes the similarity calculation program 171 in which the function of the control unit 14 of the similarity calculation apparatus 10 is implemented as software will be described.

図８は、類似性算出プログラム１７１を実行するコンピュータ１００を示す機能ブロック図である。コンピュータ１００は、各種演算処理を実行するＣＰＵ（Central Processing Unit）１１０と、ユーザからのデータの入力を受け付ける入力装置１２０と、各種情報を表示するモニタ１３０と、記録媒体からプログラム等を読み取る媒体読取り装置１４０と、ネットワークを介して他のコンピュータとの間でデータの授受を行うネットワークインターフェース装置１５０と、各種情報を一時記憶するＲＡＭ（Random Access Memory）１６０と、ハードディスク装置１７０とをバスで接続して構成される。 FIG. 8 is a functional block diagram illustrating the computer 100 that executes the similarity calculation program 171. The computer 100 includes a CPU (Central Processing Unit) 110 that executes various arithmetic processes, an input device 120 that receives input of data from a user, a monitor 130 that displays various information, and a medium reading that reads a program from a recording medium. The device 140, a network interface device 150 that exchanges data with other computers via a network, a RAM (Random Access Memory) 160 that temporarily stores various information, and a hard disk device 170 are connected by a bus. Configured.

そして、ハードディスク装置１７０には、図１に示した制御部１４と同様の機能を有する類似性算出プログラム１７１と、図１に示した記憶部１３に記憶される実行モジュール１３ａに対応する機械語命令列１７２とが記憶される。なお、機械語命令列１７２は、ネットワークを介して接続された他のコンピュータに、コンピュータ１００がアクセス可能な態様で記憶されていてもよい。 The hard disk device 170 has a similarity calculation program 171 having the same function as the control unit 14 shown in FIG. 1, and a machine language instruction corresponding to the execution module 13a stored in the storage unit 13 shown in FIG. Column 172 is stored. Note that the machine language instruction sequence 172 may be stored in a manner that the computer 100 can access another computer connected via a network.

そして、ＣＰＵ１１０が類似性算出プログラム１７１をハードディスク装置１７０から読み出してＲＡＭ１６０に展開することにより、類似性算出プログラム１７１は、類似性算出プロセス１６１として機能するようになる。そして、類似性算出プロセス１６１は、機械語命令列１７２等を適宜ＲＡＭ１６０上の自身に割り当てられた領域に展開し、この展開したデータ等に基づいて各種データ処理を実行し、図１に示した類似度行列データ１３ｃに相当する算出結果データ１７３をハードディスク装置１７０等に記憶させる。 Then, the CPU 110 reads out the similarity calculation program 171 from the hard disk device 170 and develops it in the RAM 160, whereby the similarity calculation program 171 functions as the similarity calculation process 161. Then, the similarity calculation process 161 appropriately expands the machine language instruction sequence 172 and the like in an area allocated to itself on the RAM 160, executes various data processing based on the expanded data, and the like, as shown in FIG. Calculation result data 173 corresponding to the similarity matrix data 13c is stored in the hard disk device 170 or the like.

なお、上記の類似性算出プログラム１７１は、必ずしもハードディスク装置１７０に格納されている必要はなく、ＣＤ−ＲＯＭ等の記憶媒体に記憶されたこのプログラムを、コンピュータ１００が読み出して実行するようにしてもよい。また、公衆回線、インターネット、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等を介してコンピュータ１００に接続される他のコンピュータ（またはサーバ）等にこのプログラムを記憶させておき、コンピュータ１００がこれらからプログラムを読み出して実行するようにしてもよい。 The similarity calculation program 171 is not necessarily stored in the hard disk device 170, and the computer 100 may read and execute the program stored in a storage medium such as a CD-ROM. Good. The computer 100 stores the program in another computer (or server) connected to the computer 100 via a public line, the Internet, a LAN (Local Area Network), a WAN (Wide Area Network), or the like. You may make it read and run a program from these.

本発明にかかる類似性算出装置、類似性算出方法および類似性算出プログラムは、マルウェアのように悪意に基づいて改変された機械語命令列の類似性を算出する目的だけでなく、例えば、機構追加や不具合修正等の目的で改変された機械語命令列の類似性を算出する目的や、ソースコードの盗用が疑われる機械語命令列の類似性を算出する目的のように多様な目的で利用することができる。 The similarity calculation device, the similarity calculation method, and the similarity calculation program according to the present invention are not only for the purpose of calculating the similarity of a machine language instruction sequence modified based on malicious intentions like malware, but also for example, a mechanism addition It is used for various purposes such as calculating the similarity of machine language instruction sequences modified for the purpose of correcting bugs, etc., and calculating the similarity of machine language instruction sequences suspected of theft of source code. be able to.

１０、２０類似性算出装置
１１表示部
１２入力部
１３、２３記憶部
１３ａ実行モジュール
１３ｂ縮約命令列
１３ｃ類似度行列データ
１４、２４制御部
１４ａアンパッキング部
１４ｂ逆アセンブル部
１４ｃ縮約命令列生成部
１４ｄ最長共通部分列抽出部
１４ｅ、２４ｅ類似性算出部
２３ｃ差分解析結果データ
１００コンピュータ
１１０ＣＰＵ
１２０入力装置
１３０モニタ
１４０媒体読取り装置
１５０ネットワークインターフェース装置
１６０ＲＡＭ
１６１類似性算出プロセス
１７０ハードディスク装置
１７１類似性算出プログラム
１７２機械語命令列
１７３算出結果データ DESCRIPTION OF SYMBOLS 10, 20 Similarity calculation apparatus 11 Display part 12 Input part 13, 23 Memory | storage part 13a Execution module 13b Reduction instruction sequence 13c Similarity matrix data 14, 24 Control part 14a Unpacking part 14b Disassembly part 14c Reduction instruction sequence generation Unit 14d longest common subsequence extraction unit 14e, 24e similarity calculation unit 23c difference analysis result data 100 computer 110 CPU
120 Input Device 130 Monitor 140 Medium Reading Device 150 Network Interface Device 160 RAM
161 Similarity calculation process 170 Hard disk device 171 Similarity calculation program 172 Machine language instruction sequence 173 Calculation result data

Claims

A similarity calculation device that calculates the similarity of a plurality of machine language instruction sequences,
Reduced instruction sequence generation means for generating a reduced instruction sequence that is an array of reduced instructions obtained by removing an operand part from each machine language instruction included in the machine language instruction sequence for each of the plurality of machine language instruction sequences;
A longest common partial sequence extracting means for comparing the reduced instruction sequences generated by the reduced instruction sequence generating means and extracting a longest common partial sequence;
A similarity calculation device comprising: similarity calculation means for calculating similarity of the machine language instruction sequence based on the longest common partial sequence extracted by the longest common partial sequence extraction means.

2. The similarity calculation apparatus according to claim 1, wherein the longest common subsequence extraction unit extracts the longest common subsequence from the contracted instruction sequence based on a bit vectorization algorithm.

3. The similarity calculation apparatus according to claim 1, wherein the contracted instruction sequence generation unit generates a contracted instruction sequence that is an array of contracted instructions having a predetermined bit length.

4. The similarity according to claim 1, wherein the similarity calculation unit calculates the similarity of the machine language instruction sequence based on the length of the longest common subsequence. Calculation device.

The similarity calculation apparatus according to claim 1, wherein the similarity calculation unit extracts a difference between the machine language instruction sequences based on the longest common subsequence.

A similarity calculation method for calculating the similarity of a plurality of machine language instruction sequences,
For each of the plurality of machine language instruction sequences, a contracted instruction sequence generating step for generating a contracted instruction sequence that is an array of contracted instructions obtained by removing an operand part from each machine language instruction included in the machine language instruction sequence;
A longest common partial sequence extraction step of comparing the reduced instruction sequences generated in the reduced instruction sequence generation step and extracting a longest common partial sequence;
A similarity calculation step of calculating the similarity of the machine language instruction sequence based on the longest common subsequence extracted in the longest common subsequence extraction step.

A similarity calculation program for calculating the similarity of a plurality of machine language instruction sequences,
For each of the plurality of machine language instruction sequences, a contracted instruction sequence generation procedure for generating a contracted instruction sequence that is an array of contracted instructions obtained by removing an operand part from each machine language instruction included in the machine language instruction sequence;
A longest common partial sequence extraction procedure for comparing the reduced instruction sequences generated by the reduced instruction sequence generation procedure and extracting the longest common partial sequence;
A similarity calculation program for causing a computer to execute a similarity calculation procedure for calculating similarity of the machine language instruction sequence based on the longest common partial sequence extracted by the longest common subsequence extraction procedure.