TWI358639B

TWI358639B - Malware detection system, data mining module, malw

Info

Publication number: TWI358639B
Application number: TW96138249A
Authority: TW
Inventors: Shi Jinn Houng; Kun Asien Hsiao
Original assignee: Univ Nat Taiwan Science Tech
Priority date: 2007-10-12
Filing date: 2007-10-12
Publication date: 2012-02-21
Also published as: TW200917020A

Description

13586391358639

三達編號：TW3715PA * 九、發明說明：【發明所屬之技術領域】本發明是有關於一種惡意程式偵測系統、資料採礦模組與惡意程式偵測模組，且特別是有關於一種使用基於資料採礦技術之惡意程式偵測系統、資料採礦模組與惡意程式偵測模組。【先前技術】 • 近年來，惡意程式的演變十分快速。傳統防毒系統係由已知惡意程式中取出對應之樣版（Pat tern )，存於其資料庫中。每個程式，包括惡意程式與非惡意程式，均對應一獨一無二的樣版。當欲偵測一待測程式時，傳統防毒系統係比對此待測程式所對應的樣版與存於資料庫中的樣版。當傳統防毒系統比對到此待測程式所對應的樣版與資料庫中的某一樣版完全相同，傳統防毒系統即偵測出此待測程式為已知的惡意程式。 ® 然而，惡意程式往往以極快的速度演變成不同的變種惡意程式。變種惡意程式與其惡意程式的行為係十分類似，但兩者所對應的樣版仍有差異。舉例來說，當一已知惡意程式A演變成新的變種惡意程式A’時，即使變種惡意程式A’與已知惡意程式A的行為類似，且傳統防毒系統已有已知惡意程式A的樣版’傳統防毒糸統仍無法成功偵測到變種惡意程式A’ 。如此，傳統防毒系統僅能出偵測已知惡意程式，無法偵測由已知惡意程式所演變而來， 6 1358639达达编号号: TW3715PA * IX, invention description: [Technical field of the invention] The present invention relates to a malware detection system, a data mining module and a malicious program detection module, and in particular to a use based on Data mining technology malware detection system, data mining module and malware detection module. [Prior Art] • In recent years, the evolution of malicious programs has been very rapid. The traditional anti-virus system removes the corresponding pattern (Pat tern) from the known malware and stores it in its database. Each program, including both malicious and non-malware, corresponds to a unique template. When a program to be tested is to be detected, the conventional anti-virus system is compared to the sample stored in the database for the sample corresponding to the program to be tested. When the traditional anti-virus system is identical to the version of the sample corresponding to the program to be tested, the traditional anti-virus system detects that the program is a known malware. ® However, malware often evolves into very different variants of malware at an extremely fast rate. The variant malware is very similar to the behavior of its malware, but the corresponding versions of the two are still different. For example, when a known malware A evolves into a new variant malware A', even if the variant malware A' is similar to the known malware A, and the conventional antivirus system has a known malware A The sample 'traditional anti-virus system still can't successfully detect the variant malware A'. In this way, the traditional anti-virus system can only detect known malicious programs, and cannot detect the evolution of known malicious programs. 6 1358639

三達編號：TW3715PA 且行為與已知惡意程式類似的新變種惡意程式。因此，傳統防毒系統無法應付日益增多的變種惡意程式。當新的變種惡意程式出現時，在傳統防毒系統取得此變種惡意程式的樣版之前，此惡意程式早已對用戶端的電腦造成傷害。【發明内容】本發明係有關於一種惡意程式偵測系統。本發明之惡意程式偵測系統，僅使用已知惡意程式與已知非惡意程式的特徵即可偵測出與已知惡意程式同類型但從未出現過的變種惡意程式。根據本發明（之第一方面），提出一種資料採礦模組，用以依據數個已知惡意程式（Ma 1 ware)與數個已知非惡意程式，輸出一分類（Classif ication )模型（Model)。一待測程式係依據分類模型被分類為惡意程式與非惡意程式其中之一。資料採礦模組包括一程式資料庫、一特徵採礦單元、一特徵篩選單元與一分類模型訓練單元。程式資料庫用以儲存已知惡意程式與已知非惡意程式。特徵採礦單元用以由已知惡意程式與已知非惡意程式中萃取出N個待篩選特徵（feature)。第i個待篩選特徵係為已知惡意程式和已知非惡意程式中之至少其一與一檔案系統之互動行為。已知惡意程式與已知非惡意程式中之至少其一係具有第i個待篩選特徵。i為一小於或等於N之正整數。特徵篩選單元由N個待篩選特徵篩選出數個有效特徵。每個有效特徵係實質上主要為已知惡意程式與已知非惡意 7 1358639 三達編號：TW3715PA 已知惡意程式與已知非惡意其中之— ，型分=型訓練單元用以依據有效特二練得η類吴^。4程式债測模組包括一特徵分析單元、徵篩選單元與-分類器。特徵分析單元用以由待I墓寺徵。初步特徵係為待測程式與槽案二之二特徵筛選單元用以依據有效特徵，由初步 ::!特徵將待測程式分類為惡意程式4:5其依用以二發明(之第四方面)’提出-種資料採礦方法，個已:惡意程式與數個已知非惡意程式，二式:二式係依據分類模型被_ 式:、非心4式其中之一。資料採礦方法包括广第意程广與已知非惡意程式中萃取出Ν個待』特 = =惡意程式和已知非惡意或等於Ν之正整數。接:由、Ν之互動订為。1為一小於右—數接考，由Ν個待篩選特徵篩選出數個 ===== 據有效特徵訓練得到分類模型。又根據本發明（之第方法，用以_—待測程式測方法包括：首先，由佐、目丨和八心思耘式偵步特徵係為待測程式與 9 1358639A new variant of the third-generation TW3715PA that behaves like a known malware. As a result, traditional antivirus systems are unable to cope with the growing number of variants. When a new variant of the malware appears, the malicious program has already caused damage to the client's computer before the traditional antivirus system obtained the variant of the malware. SUMMARY OF THE INVENTION The present invention is directed to a malware detection system. The malware detection system of the present invention can detect variants of the same type but never seen by known malicious programs using only the characteristics of known malicious programs and known non-malicious programs. According to a first aspect of the present invention, a data mining module is provided for outputting a classification model based on a number of known malicious programs (Ma 1 ware) and a plurality of known non-malicious programs (Model) ). A program to be tested is classified into one of a malicious program and a non-malicious program according to a classification model. The data mining module includes a program database, a feature mining unit, a feature screening unit and a classification model training unit. The program library is used to store known malware and known non-malicious programs. The feature mining unit is used to extract N features to be screened from known malware and known non-malicious programs. The i-th to-be-screened feature is an interaction behavior between at least one of a known malicious program and a known non-malicious program and a file system. It is known that at least one of the malicious programs and the known non-malicious programs has the i-th to-be-screened feature. i is a positive integer less than or equal to N. The feature screening unit filters out several valid features from the N features to be filtered. Each valid feature is essentially a known malware and a known non-malicious 7 1358639 three-numbered: TW3715PA known malware and known non-malicious - the type = type training unit is used to validate the second Practice η class Wu ^. The program debt testing module includes a feature analyzing unit, a screening unit and a classifier. The feature analysis unit is used to levy the tomb. The preliminary feature is the second program of the program to be tested and the second feature of the slot screening unit. According to the effective feature, the program to be tested is classified into a malicious program by the preliminary::! feature. Aspects] 'Proposed-type data mining methods, one has: malicious programs and several known non-malicious programs, two types: two types are based on the classification model is one of _:: non-heart 4. Data mining methods include the wide-ranging and wide-ranging and non-malicious programs that extract a single positive == malware and a known non-malicious or equal positive integer. Pick up: by the interaction of Ν 订. 1 is a less than right-number reference, and several features are selected by a feature to be screened ===== According to the effective feature training, the classification model is obtained. According to the present invention (the first method, the method for measuring _ to be tested includes: first, the program of the test, the target, and the eight-hearted detection type is the program to be tested and 9 1358639

» 三達編號：TW3715PA f少其中-個程式萃取而得。每個待韩選特徵ρι至fn係為所有已知惡意程式pm和所有已知非惡意程式⑼中之至少其一與一檔案系統之互動行為。舉例來說，所有已知惡意程式的其中數個已知惡意程式與所有已知非惡意程式的其中數個已知非惡意程式具有待筛選特徵FW匕表示上述數個已知惡意程式Pm和上述數個已知非惡意程式P b均具有相同的與檔案系統的互動行為。由於惡意程式與非惡意程式使用動態連結檔 (Dynamic Link Library，DLL)的方式不同，因此，在本發明實施例中，特徵採礦單元丨丨2係萃取每個已知惡意私式Pm與每個已知非惡意程式此所使用的動態連結檔的路徑與每個程式所使用的應'用#式介号（AppHcati〇n Program Interface，API) ’ 作為待轉選特徵 π 至 FN。在本發明實施例中，特徵採礦單元112由一程式，即 φ 已知惡思釭式或已知非惡意程式，所萃取出來的待篩選特徵共分四種。第一種為此程式直接使用的第一層動態連結檔。第二種為此程式所使用的第一層動態連結檔至最後一層動fe連結檔的路控。第三種為上述第一層動態連結檔中’被此秋式所使用的應用程式介面。第四種為上述第一層動恕連結檔中，被其他動態連結檔所使用的應用程式介面。以萃取某一程式Fi 1 emon. exe與視窗作業系統 (W i n d 〇 w s )的檔案系統的互動行為作為待篩選特徵為 11 1358639» Sanda number: TW3715PA f Less than one of the programs extracted. Each feature to be selected ρι to fn is an interaction behavior of at least one of all known malicious programs pm and all known non-malicious programs (9) with a file system. For example, several known malwares of all known malicious programs and several known non-malicious programs of all known non-malicious programs have a feature to be filtered FW, indicating the above-mentioned several known malicious programs Pm and The above several known non-malicious programs P b have the same interaction behavior with the file system. Since the malicious program and the non-malicious program use a dynamic link library (DLL) in a different manner, in the embodiment of the present invention, the feature mining unit 丨丨 2 extracts each known malicious private Pm and each It is known that the path of the dynamic link file used by the non-malicious program and the AppHcati〇n Program Interface (API) used by each program are to be selected as feature to be selected π to FN. In the embodiment of the present invention, the feature mining unit 112 is divided into four types by a program, that is, φ known as a bad thinking or a known non-malicious program. The first type of dynamic link file that is used directly by this program. The second type of dynamic link file used for this program is the route to the last layer of the link. The third type is the application interface used by the autumn type in the first layer of the dynamic link file. The fourth type is the application interface used by other dynamic link files in the first layer of the first move. To extract the interaction behavior of a program Fi 1 emon. exe and the file system of the Windows operating system (W i n d 〇 w s ) as the feature to be screened is 11 1358639

三達編號：TW3715PA 例’ Filemon.exe所使用的第一層動態連結檔包括 C0MCTL32. DLL、KERNAL32. DLL 與 USER32. DLL 等動態連結檔。因此，特徵採礦單元112萃取上述第一層動態連結檔作為Filemon. exe的待篩選特徵。上述苐一層動恕連結樓可能會使用到第二層動態連結檔，而上述第二層動態連結檔可能會使用到第三層動態連結槽’其餘狀況依此類推至最後一層動態連結槽。特徵採礦單元112即萃取第一層的每個動態連結檔至最後一層的每個動態連結檔的路徑作為此程式的待篩選特徵。舉例來說’第一層動態連結檔中的USER32.DLL，係使用到第二層動態連結檔GDI32.DLL、KERNAL32.DLL與 MS IMG32· DLL等動態連結樓。而上述第二層動態連結樓中的KERNAL32. DLL係使用到最後一層動態連結檔 NTDLL· DLL。因此，特徵採礦單元112係萃取第一層動態連結標中的USER32. DLL、第二層動態連結檔中的 KERNAL32. DLL至最後一層動態連結檔中的NTDLL. DLL所形成的路徑作為F i 1 emon. exe的待篩選特徵。上述係以萃取第一層動態連結檔中的USER32.DLL所使用的動態連結檔路徑為例’對於第一層動態連結檔中的其他動態連結槽，例如C0MCTL32· DLL，亦以相同方式萃取其所使用的動態連結檔路徑。特徵採礦單元112亦萃取上述第一層動態連結檔中，Filemon, exe所使用到的應用程式介面，例如 RtlFreeHeap、RtlAllocateHeap 與 RtlGetLastWin32Error 12 1358639Sanda number: TW3715PA Example ' The first layer of dynamic link files used by Filemon.exe includes dynamic links such as C0MCTL32.DLL, KERNAL32.DLL and USER32.DLL. Therefore, the feature mining unit 112 extracts the above-mentioned first layer dynamic link file as a feature to be screened of Filemon.exe. The second layer of dynamic linking links may use the second layer of dynamic linking files, and the second layer of dynamic linking files may use the third layer of dynamic linking slots, and the rest of the conditions may be pushed to the last layer of dynamic linking slots. Features The mining unit 112 extracts the path of each dynamic link from the first layer to the dynamic link of the last layer as the feature to be screened for this program. For example, USER32.DLL in the first layer of dynamic link file uses dynamic link building such as the second layer dynamic link files GDI32.DLL, KERNAL32.DLL and MS IMG32·DLL. The KERNAL32.DLL in the second layer of the dynamic link building uses the last layer of dynamic link file NTDLL·DLL. Therefore, the feature mining unit 112 extracts the path formed by the USER32.DLL in the first layer dynamic link, the KERNAL32.DLL in the second layer dynamic link file, and the NTDLL.DLL in the last layer dynamic link file as F i 1 The feature of emon. exe to be filtered. The above is an example of extracting the dynamic link path used by USER32.DLL in the first layer dynamic link file. For other dynamic link slots in the first layer dynamic link file, such as C0MCTL32·DLL, the same is also extracted in the same manner. The dynamic link path used. The feature mining unit 112 also extracts the application interface used by Filemon, exe in the first layer dynamic link file, such as RtlFreeHeap, RtlAllocateHeap and RtlGetLastWin32Error 12 1358639.

三達編號：TW3715PA 等等，作為Filemon.exe的待篩選特徵。特徵採礦單元112 並萃取上述第一層動態連結檔中，被其他動態連結檔所使用的應用程式介面’例如是CsrAllocateCaptureBuffer、 CsrAllocateMessagePointer 與 RtlSizeHeap 等等，作為 Filemon. exe的待篩選特徵。特徵採礦單元112係由每個已知惡意程式Pm與每個已知非惡意程式Pb萃取出待篩選特徵F1至fn後，第一Sanda number: TW3715PA, etc., as a feature to be filtered by Filemon.exe. The feature mining unit 112 extracts the application interfaces used by other dynamic links in the first layer dynamic link file, such as CsrAllocateCaptureBuffer, CsrAllocateMessagePointer and RtlSizeHeap, etc., as the to-be-screened feature of Filemon.exe. The feature mining unit 112 first extracts the features to be screened F1 to fn by each known malicious program Pm and each known non-malicious program Pb, first

特徵筛選單元113即由待篩選特徵F1至FN篩選出數個有效特徵Fe。 ^詳述第一特徵篩選單元113之動作。在本發明實施例中’由於待篩選特徵的數量很多，且許多待篩選特徵可 /月b同時為已知非惡意程式與已知非惡意程式所具有的特徵/因此’第—特徵篩選單元113係逐一決定每個待篩選 =徵F1至FN疋否為有效特徵。其中，有效特徵Fe係實=上主要為已知惡意程式與已知非惡意程式其中之一所具有的特徵。亦即，有效特徵Fe係僅符合以下兩個情況其一。筮 ^ ^ 立。矛一種情況是有效特徵Fe實質上主要為已二:思、私式所具有的特徵。第二種情況是有效特徵Fe 實質主要為已知非惡意程式所具有的特徵。例如，在1 Π η η / Λ . υϋ個已知惡意程式Pm與1050個已知非惡思私式P b，有如n xta … ΟΟΛ d00個已知惡意程式Pm具有待篩選特徵 F1，有320個已知非 lL α , 非惡意程式Pb也具有待篩選特徵F1。The feature screening unit 113 filters out a plurality of effective features Fe from the features to be screened F1 to FN. ^Details of the action of the first feature screening unit 113. In the embodiment of the present invention, 'the number of features to be screened is large, and many features to be screened/month b are both features of known non-malicious programs and known non-malicious programs. Therefore, the first feature filtering unit 113 One by one determines whether each to be filtered = sign F1 to FN is a valid feature. Among them, the effective feature Fe system is mainly characterized by one of the known malicious programs and one of the known non-malicious programs. That is, the effective feature Fe is only one of the following two cases.筮 ^ ^ Standing. One case of spears is that the effective feature Fe is essentially the second one: the characteristics of thinking and private. The second case is that the effective feature Fe is essentially a feature of a known non-malicious program. For example, at 1 Π η η / Λ . 已知 a known malware Pm and 1050 known non-spoofed P b, like n xta ... ΟΟΛ d00 known malware Pm has to be filtered feature F1, there are 320 The known non-lL α , non-malicious program Pb also has the feature F1 to be filtered.

如此，已知惡音鉬斗I Λ ^ , 〜式出現待篩選特徵F1的機率與已知非惡意程式出現待餘币選特徵F1的機率相當。待筛選特徵F1 13 13^8639 Ξ達編號：TW3715PA 為已知惡意私式所具有的特徵的確定程度很低，且待筛選特徵Fi為已知非惡意程式所具有的特徵的確定程度也很低。亦即’存師選特徵F1並非實質上式所具有的特徵，亦非實質上主要為已知非惡意程；^ 且有的特’弟一特徵筛選單元ιΐ3將待筛選特徵^ 剔除’不作為有效特徵pe。另外’舉例來說，有500個已知惡意程式pm具有待篩選特徵F2,而僅有2〇個已知非惡意程式托具有待筛選特徵如此’已知惡意程式出現待_選特徵^的機率，實質上通大於已知非惡意程式出現待篩率。待;選特徵打為已知惡意程式所具有的特徵: 程度很阿亦即，待篩選特徵F2實質上主4 择式所具有的特徵。因此，第-特徵意歸選特徵F2為一有效特徵Fe。 &早疋113決定待類似地，舉例來說’僅有50個已知亞咅符歸選特徵F3,而卻有個已知非惡具有篩選特徵F3。如此，已知非惡意程式出現二阳具有待的機率，實質上遠大於已知亞音浐，師忠特徵F3 的機率。㈣選特徵選特徵F3 嫁定租度很高。亦即，待篩選特徵U ^ 的特徵的 #恶意程式所具有的特徵。因此，第—特徵^要為已知亦決定待筛選特徵F3為-有效特徵Fe。、*〜早70113 如此’第一特徵筛選單元113即符_徵F1至㈣中，筛選出辨方式，由N個刀辨已知惡意程式 1358639Thus, it is known that the probability of the appearance of the characteristic F1 to be screened by the snoring mop I Λ ^ , 〜 is equivalent to the probability that the known non-malicious program appears to be the remaining feature F1. Feature to be filtered F1 13 13^8639 Ξ达号: TW3715PA The degree of certainty of the features known to be malicious private is very low, and the feature to be screened Fi is the degree of certainty of the characteristics of known non-malicious programs. Very low. That is to say, the feature of the teacher selection F1 is not a feature of the substantive formula, nor is it mainly a known non-malicious process; ^ and some special features of the feature screening unit ιΐ3 remove the feature to be screened ^ Not as a valid feature pe. In addition, for example, there are 500 known malicious programs pm having the feature F2 to be filtered, and only 2 known non-malicious programs have the feature to be filtered, so that the known malware appears to be selected. The probability, in fact, is greater than the known non-malicious program. The selection feature is characterized by a known malware: The degree is very high, that is, the feature to be screened F2 is essentially a feature of the main alternative. Therefore, the first feature means that the feature F2 is an effective feature Fe. & 113 decided to wait similarly, for example, 'only 50 known Aachen character selection features F3, but there is a known non-evil with screening feature F3. In this way, it is known that the non-malicious program has a chance to wait for the yang, which is substantially greater than the probability of the known sub-sounds and the loyalty characteristics F3. (4) The feature selection feature F3 has a high degree of rent. That is, the feature of the # malicious program of the feature of the feature U ^ to be filtered. Therefore, the first feature is also known to determine that the feature to be screened F3 is the effective feature Fe. , *~ early 70113 so that the first feature screening unit 113 is in the symbol _ sign F1 to (four), the screening method is selected, and the known malware is identified by N knives 1358639

三達編號：TW3715PA 與已知非惡意程式的有效特徵Fe。在本發明實施例中，第一特徵篩選單元113係依據對應每個待篩選特徵的篩選參數，判斷每個待篩選特徵是否為一有效特徵。一個待篩選特徵所對應的篩選參數係相關於此待篩選特徵為已知惡意程式與已知非惡意程式其中之一類程式所具有的特徵之一確定程度。舉例來說，在本發明實施例中，當欲決定待篩選特徵 F1至FN中之待篩選特徵Fi是否為有效特徵，第一特徵篩選單元113係依據具有待篩選特徵Fi的已知惡意程式的個數與具有待篩選特徵Fi的非惡意程式的個數產生對應待筛選特徵Fi之筛選參數Pi (未繪示）。筛選參數Pi係相關於待篩選特徵Fi為已知惡意程式與已知非惡意程式其中之一類程式所具有的特徵之確定程度。其中，i為一正整數，。 — 若篩選參數Pi高於一門檻值，表示待筛選特徵Fi為已知惡意程式與已知非惡意程式其中之一類程式所具有的特徵之確定程度係足夠高，待篩選特徵Fi係實質上主要為已知惡意程式與已知非惡意程式其中之一類程式所具有的特徵，特徵篩選單元113即決定待篩選特徵Fi為有效特徵F e。在本發明實施例中，第一特徵篩選單元113係計算每個待篩選特徵之資訊增益（Information gain)，作為每第1式個待篩選特徵所對應的篩選參數。 j Gain(S, Fi) = Info(S) - InfoFi (S) 15 1358639Sanda number: TW3715PA and the effective feature Fe of known non-malicious programs. In the embodiment of the present invention, the first feature screening unit 113 determines whether each feature to be selected is a valid feature according to a screening parameter corresponding to each feature to be selected. The screening parameter corresponding to a feature to be selected is related to the degree to which the feature to be screened is one of the characteristics of a known malware and one of the known non-malicious programs. For example, in the embodiment of the present invention, when it is determined whether the feature to be selected Fi in the features to be selected F1 to FN is a valid feature, the first feature screening unit 113 is based on a known malicious program having the feature Fi to be filtered. The number of non-malicious programs having the feature Fi to be filtered generates a screening parameter Pi (not shown) corresponding to the feature to be screened Fi. The screening parameter Pi is related to the feature to be screened, Fi, which is the degree of certainty of the features of the known malware and one of the known non-malicious programs. Where i is a positive integer. - If the screening parameter Pi is higher than a threshold, the degree of certainty of the feature to be filtered Fi is one of the known malicious programs and one of the known non-malicious programs. The degree of certainty of the feature to be screened is sufficiently high. The feature screening unit 113 determines that the feature to be screened Fi is a valid feature F e, which is mainly a feature of a known malicious program and one of the known non-malicious programs. In the embodiment of the present invention, the first feature screening unit 113 calculates the information gain of each feature to be filtered as the screening parameter corresponding to each feature to be selected. j Gain(S, Fi) = Info(S) - InfoFi (S) 15 1358639

三達編號·· TW3715PA 、，第1式係為本發明實施例中，待篩選特徵Fi之資訊增益。其中，S為所有已知惡意程式p m與所有已知非惡意程式Pb所成的集合。第！式的鄉)為上述集合 (Entropy)，其數學描述如第2式所示。The ternary number TW3715PA and the first formula are the information gains of the feature Fi to be screened in the embodiment of the present invention. Where S is the set of all known malware p m and all known non-malicious programs Pb. The first! The township is the above set (Entropy), and its mathematical description is shown in the second formula.

InMS) = -Yip.\〇g^p) y=1 第2式InMS) = -Yip.\〇g^p) y=1 2nd

亞-二第/ί中’ J係等於1或等於2。A為在所有已知 …已知非惡意程式中，所有已知惡意程式職的比 :二2 =所有已知惡意與已知非惡意程式中，所有已知非惡意程式所佔的比例。，另外’第1式中的/♦#)為待篩選特徵以的 1 數學描述如第3式所示。 ’、娜·-第3式在第3式中，k係等於〇或卜&為集合3中，具有待筛選特徵Fl的已知惡意程式與已知非惡意程式所成、 =集合。〜為集合s中’不具有待_特徵Η的已知惡思程式與已知非惡意程式所成的集合。因此，^為在; 有已知惡思與已知非惡意程式中，具有待筛選特徵Η的已知惡,5程式與已知非惡意程式所佔的比例；而㈤私士 1 , |5| °惡思與已知非惡意程式中，不具有待_選特徵^ 的已知惡意程式與已知非惡意程式所㈣比例。另外，第3式中的卿為集合％的烟，其數學 1358639The sub-JD / ί ' J system is equal to 1 or equal to 2. A is the ratio of all known malicious programs in all known...known non-malicious programs: 2 2 = the proportion of all known non-malicious programs among all known malicious and known non-malicious programs. Further, /♦# in the first formula is a mathematical description of the feature to be screened as shown in the third formula. ', 娜·- 3rd Formula In the 3rd formula, k is equal to 〇 or 卜 & is the set 3, the known malware with the feature F1 to be filtered and the known non-malicious program, = set. ~ is a collection of known malicious programs in the set s that do not have a feature to be _ and a known non-malicious program. Therefore, ^ is in; there are known evils and known non-malicious programs, the known evils with the features to be filtered, the proportion of 5 programs and known non-malicious programs; and (5) the privates 1, | 5| ° The ratio of known malware and known non-malicious programs (4) that do not have the feature to be selected in the malicious and known non-malware programs. In addition, the Qing in the third formula is a collection of smoke, its mathematics 1358639

三達編號·· TW37丨5PA 描述如第4式所示。其中，為具有待篩選特徵Fi 的已知惡意程式與已知非惡意程式所成的集合的熵；而為不具有待篩選特徵Fi的已知惡意程式與已知非惡意程式所成的集合的熵。 1 1〇§2(^7上·^) 第 4 式 /«=0 在第4式中，m係等於0或1。其中，對於/«V·；)，如％.表示在所有具有待篩選特徵Fi的已知惡意程式與已知非惡意程式中，具有待篩選特徵Fi的已知惡意程式所佔的比例；而表示在所有具有待篩選特徵Fi的已知惡意程式與已知非惡意程式中，具有待篩選特徵Fi的已知非惡意程式所佔的比例。同理，知X&f；)亦以相同方式得到。舉例來說，在1000個已知惡意程式Pm與1050個已知非惡意程式Pb中，有300個已知惡意程式Pm具有待篩選特徵F1，有320個已知非惡意程式Pb也具有待篩選特徵F1。則2050個程式中共有620個程式具有待篩選特徵 Fi，有1430個程式不具有待篩選特徵Fi。1000個已知惡意程式中有700個已知惡意程式不具有待篩選特徵Fi，而 1050個已知惡意程式中有730個已知惡意程式不具有待篩The three-digit number··TW37丨5PA description is shown in the fourth formula. Wherein, the entropy of the set of known malware and known non-malicious programs having the feature Fi to be filtered; and the set of known malware and known non-malicious programs not having the feature Fi to be filtered entropy. 1 1〇§2(^7上·^) 4th formula /«=0 In the 4th formula, m is equal to 0 or 1. Wherein, for /«V·;), such as %. indicates the proportion of known malicious programs having the feature Fi to be filtered among all known malicious programs and known non-mali programs having the feature Fi to be filtered; Represents the proportion of known non-malicious programs with the feature Fi to be filtered among all known malicious programs and known non-mali programs with feature Fi to be filtered. Similarly, X&f;) is also obtained in the same way. For example, among 1000 known malware Pm and 1050 known non-malicious programs Pb, there are 300 known malware Pm with feature F1 to be filtered, and 320 known non-malicious programs Pb also have to be screened. Feature F1. A total of 620 programs in the 2050 programs have the feature Fi to be filtered, and 1430 programs do not have the feature Fi to be selected. 700 known malicious programs in 1000 known malware do not have the feature Fi to be filtered, and 730 known malware in 1050 known malware do not have to be screened.

Adt V · L LL r ,1000 , 1000 1050 , 1050、選特徵 F 1 。女口 it匕，Info(S) = -(-log2-+-log2-) < J J 2050 2 2050 2050 2 2050 r , 620 300, 300 320, 320、1430,700 , 700 730 , InfoFi (S) = (— log — + — log —) + (77^7 log + 77^7 log ，而 730 ^2050 620 620 620 °620 2050 1430 1430 1430 1430 ))。如 17 1358639Adt V · L LL r , 1000 , 1000 1050 , 1050, feature F 1 . Female mouth it匕, Info(S) = -(-log2-+-log2-) < JJ 2050 2 2050 2050 2 2050 r , 620 300, 300 320, 320, 1430, 700 , 700 730 , InfoFi (S) = (- log — + — log —) + (77^7 log + 77^7 log , and 730 ^2050 620 620 620 °620 2050 1430 1430 1430 1430 )). Such as 17 1358639

三達編號：TW3715PA 此，即得到待篩選特徵Fi的資訊增益，作為其篩選參數Pi。 4 由於/_f,〇s)係為待篩選特徵Fi的熵，當待篩選特徵 Fi的熵越大，表示待篩選特徵Fi的資料混亂程度越高。即表示已知惡意程式出現待篩選特徵Fi的機率與已知非惡意程式出現待篩選特徵Fi的機率越相近。如此，待篩選特徵Fi的資訊增益越低。待篩選特徵Fi為已知惡意程 I 式所具有的特徵的確定程度很低，且待篩選特徵Fi為已知非惡意程式所具有的特徵的確定程度也很低。因此，在本發明實施例中，當待篩選特徵Fi的資訊增益低於一門檻值時，第一特徵篩選單元113即剔除待篩選特徵Fi，不作為有效特徵Fe。反之，當待篩選特徵Fi的熵越大，表示已知惡意程式出現待篩選特徵Fi的機率與已知非惡意程式出現待篩選特徵Fi的機率差距越大。如此，待篩選特徵Fi的資訊 φ 增益越高。如此，對於待篩選特徵Fi，以下兩情況只有其一會成立。第一種情況是待篩選特徵Fi為已知惡意程式所具有的特徵的球定程度很南’亦即’待師選特徵F i貫質上主要為已知惡意程式所具有的特徵。第二種情況是待篩選特徵Fi為已知非惡意程式所具有的特徵的確定程度很高，亦即，待篩選特徵Fi實質上主要為已知非惡意程式所具有的特徵。因此，當待篩選特徵Fi的資訊增益高於門檻值時，第一特徵篩選單元113即決定待篩選特徵Fi為有效特徵 18Sanda number: TW3715PA This is the information gain of the feature Fi to be filtered as its screening parameter Pi. 4 Since /_f, 〇s) is the entropy of the feature Fi to be filtered, when the entropy of the feature Fi to be screened is larger, the degree of data confusion of the feature Fi to be screened is higher. That is to say, the probability that a known malicious program has a feature to be filtered Fi is similar to the probability that a known non-malicious program has a feature to be screened Fi. Thus, the lower the information gain of the feature Fi to be screened. The feature to be screened Fi is a known malicious program. The degree of certainty of the feature is very low, and the degree of certainty that the feature to be screened Fi is known to be a non-malicious program is also low. Therefore, in the embodiment of the present invention, when the information gain of the feature to be screened Fi is lower than a threshold, the first feature screening unit 113 rejects the feature to be screened, Fi, and does not serve as the effective feature Fe. On the other hand, when the entropy of the feature to be screened Fi is larger, the probability that the known malicious program has a feature to be screened Fi is larger than the probability that a known non-malicious program appears to be screened for the feature Fi. Thus, the information φ gain of the feature Fi to be filtered is higher. Thus, for the feature Fi to be screened, only the following two cases will be established. The first case is that the feature to be screened Fi is a feature of a known malware that is very south. That is, the feature to be selected is mainly a feature of a known malware. The second case is that the feature to be screened Fi is a highly deterministic feature of a known non-malicious program, i.e., the feature to be screened Fi is essentially a feature of a known non-malicious program. Therefore, when the information gain of the feature to be screened Fi is higher than the threshold, the first feature screening unit 113 determines that the feature to be selected Fi is a valid feature.

Claims

1358639 Sanda number: TW3715PA X. Patent application scope: 1. A data mining module for outputting a classification model based on a plurality of known malicious programs (Malware) and a plurality of known non-malicious programs. The data to be tested is classified into one of a malicious program and a non-malicious program according to the classification model. The data mining module includes: a program database for storing the known malicious programs and the known ones. a non-malicious program; a feature mining unit for extracting N features to be selected from the known malicious programs and the known non-malicious programs, and an i-th feature to be screened is Knowing at least one of the malware and the known non-malicious programs interacting with a file system, the known malware and the at least one of the known non-malicious programs having the i-th For the feature to be filtered, i is a positive integer less than or equal to N; a feature screening unit that filters a plurality of valid features from the N features to be selected, each of the effective features being substantially A feature to be known for one of the known malicious programs and one of the known non-malicious programs; and a classification model training unit for training the classification model based on the effective features. 2. The data mining module of claim 1, wherein, for the i-th feature to be screened, the feature screening unit is based on the known malicious programs having the i-th to-be-screened feature The number and the number of non-malicious programs having the i-th feature to be filtered are generated corresponding to the i-th 13 13^8639 three-number: TW3715PA to be filtered features - the i-th filter parameter is related to the i-th The _ selection feature is that the general teacher selection parameter is known to be one of the non-malware programs; = the degree 1 has the ith filter parameter higher than the threshold value, and the H system determines the ith number The feature to be screened is the s teacher selection unit. 3. In the material described in claim 2, the feature screening unit is further based on the known subgroups, and the number of non-malicious programs is generated. InfomaUcKi gain is used as a solidification: the characteristics of the selection feature. 4. The second number mentioned in item 1 of the patent application. In the middle, the classification model training unit is _ 枓枓木木木 , , 其 , , , , , , , , , , , , , , , , , , , , , , , l l l l l ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ model. . Plane (Hyper 5), as in the scope of the patent application, each of the effective features is a vector two lean mining module, and the feature screening unit performs a one-dimensional reduction operation on the effective features (Dimensi 〇n Reducti〇n), to reduce the vector dimension of each valid feature. 6. The data mining module described in claim 5, wherein 'the feature I is selected based on the principal component analysis operation ( Principle Component Analysis 'pCA) analyzes the principal components of the effective features and reduces the vector dimension of the effective features according to the results of the principal component analysis operation. 7. Data mining as described in item 1 of the patent scope Module, 27: TW3715PA dynamic relationship; - feature screening unit, Tian + Zhong Xi buckle ", think 4 king style and a plurality of known non-Ali and 々甘5 style features; and ^

According to these reference features, the reference-classification model is classified into a malicious program and a non-sub-sound according to the program-resolved program, and the classification module (4) is based on the effective features:; The malware described in the item is valid according to the:: the eigen is a vector, and the feature screening unit further reduces the vector dimension of each of the reference features.

The group, wherein the evil levy bond model analysis result described in the thirteenth patent range, the temple sign screening unit further reduces the vector dimension of each of the reference features according to the principal components of the effective features. The group, the towel is as described in the patent application 帛12 item of the malicious program detection mode. The knife type is a support vector machine classifier (SVM aSSlfler), the classification model is a hyperplane. 16. The malicious program detection module described in claim 12, the complex φ, —, τ 'the known malicious programs and the known non-malicious program are stored in the program database, the malicious The program detection module further includes a malicious program notification display. Early, when the program to be tested is judged to be a malicious program, the private entity to be tested is stored in the program database as a new known malware. 17. The malware detection module 29 of claim 12: TW3715PA group 'where' the feature analysis unit extracts the preliminary features with a residual program. The meditation knife analysis method is composed of ^hai two standby group, Γφ, such as the malicious program debt test model described in item 12 of the patent scope. The parental feature is the dynamic link and application of the file. The behavior of the program interface.糸,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 2. A malware detection system comprising: a data mining module comprising: a reading library for storing a plurality of known programs/hard numbers of known non-malicious programs; 1 (a)- a feature mining unit for extracting a feature from the known malware selection=programs, each of which is to be screened=ί: the interaction behavior of the system, the known evils and the It is known that non-malicious programs have at least one characteristic of -, Ν is a positive integer; /, system ~ has (four) to be screened - a first - feature screening unit 'by the two features to be screened two = _ effect The feature 'per effective feature system is essentially the second, the Γ and the known (four) meanings - the training has a categorical model training unit, a classification model; and 1 30 丄达达号: TW3715PA 一赶' Π: Machine 5 Chuan training unit is trained according to the effective features to obtain the model, the denomination is a support vector machine to classify the program to be tested For the malicious process, the malware test described in item 20 of the patent in January -mother. The Haizheng characteristic is a vector, and the first feature screening single feature feature performs a one-dimensional reduction operation to reduce the vector dimension of each effective feature. In the case of the JL towel, the malicious program described in Item 24 of the monthly patent view, the red feature screening unit further reduces the vector dimension of each reference feature according to the effective features. For example, the malicious program price measurement system described in the S20 patent scope is the main program of the program, and the malware notification group 2 includes: the malicious program notification 2: ΐ=ί is determined to be a malicious program, and the program to be tested is stored. In the private shellfish library, as a new known malware. ^ μ Please refer to the malicious program mapping system described in item 2 () of the patent scope. The mother should initially rely on the behavior of the link building and the application interface. Each of the 峨= is ^known malware and the At least one of the known non-malicious programs - using the dynamic link slot and the dependent mode (4) behavior of the system of age. 28. The malicious program described in claim 2, wherein each known malicious program is a virus program or a worm program: one of a Trojan horse program and a back door program. 32 Π58639 i达号: TW3715PA 29. An nurturing mining method for outputting a classification model based on a plurality of known malicious programs and a plurality of known non-malicious programs, a program to be tested is classified according to the classification model One of the malware and non-malicious programs' mining methods include: (a) extracting N features to be selected from the known malicious programs and the known non-malicious programs. The teacher selection feature is an interaction behavior between the known malicious program and at least one of the known non-malicious programs and the slot system, i is a positive integer less than or equal to ^; the field (b) is The N features to be selected are selected into a plurality of valid features, and each of the valid features is substantially a feature of the known malicious programs and the known non-malicious programs; and (c) According to the axis effective remainder · job classification model. The data mining method described in Item 29 of Ij I, Item 11, in step (b), includes: the number of electricity, the corresponding features of the screening characteristics corresponding to the screening characteristics The number of the non-subsonic programs of the syllabary syllabus and the number of the non-subsequences of the first (four) non-four programs to be screened are corresponding to the first and second selection parameters, the i. The screening parameter is one of the characteristics of the '= sign, which is characterized by the known malicious programs and the certain touch-to-face frequency-authenticity; ^^(10)) is determined according to the N screening parameters. The no-signal is the valid feature. When the ith filter parameter is selected, the first feature to be selected is the valid feature. , Tan value, decision 33 1338639 Erda number: TW3715PA 31. As in the data mining method described in claim 30, 'in step (bl)' the first screening parameter is the i-th waiting Information gain of the teacher's characteristics. 32. For the data mining method described in claim 29, in step (c), based on the effective features, the support vector machine is trained to obtain a hyperplane as the classification model. In the branch (10) 33. As stated in the scope of the patent application

Wherein after step (five), the method further comprises: ', method (b) performing a dimension reduction operation on the valid features to reduce the vector dimension of the parent effective feature. Descending 34: The information to be selected as described in claim 33 is the known malware and the knowledge is known: at least one of the broad forms uses this: the behavior of the interface with the application. A malicious program detection method, whether it is a malicious program, the malicious program detection method includes: ^ ^ Grab is (a) the test program extracts a plurality of preliminary feature step features The interaction between the program to be tested and the 'initial rKx^ file system; (1) after the plural financial rules, a plurality of reference features, the effective - the first-year temple recruiter selects the malicious program and a plurality of known non-Yalli ', only on the shell is mainly a plurality of already existing features; and the continuation program (c) reference-category model, according to the program classified as malware and non-malicious pots - the middle / feature will be just - The classification model 34 Sanda number: Ding W3 715PA is based on the training of these effective features. The method of detecting a malicious program as described in the patent stipulation S 35, wherein after the step (1), the method further comprises: reducing the vector dimension of each of the reference features according to the effective features. 7* The method for detecting the malicious program described in the patent scope & item is: a flat plane is a hyperplane, in step (4), according to the reference features, the support vector machine classification class One of the evil #programs and non-malicious programs. After the law, among them, the method, the method further includes: For the new known as a malicious program, the program to be tested is the second to the second结舆 application; 介 = program using the construction system