KR101819322B1

KR101819322B1 - Malicious Code Analysis Module and Method therefor

Info

Publication number: KR101819322B1
Application number: KR1020160031262A
Authority: KR
Inventors: 이택현
Original assignee: 주식회사 엘지유플러스
Priority date: 2016-03-16
Filing date: 2016-03-16
Publication date: 2018-02-28
Also published as: KR20170107665A

Abstract

A malicious code analysis module and a malicious code analysis method thereof are disclosed. In the malicious code analysis module according to the present invention, the malicious code analysis module to be analyzed includes a dynamic analysis unit for performing a dynamic analysis on a file to be analyzed; A static analysis unit for performing a static analysis on a file to be analyzed; A database generation unit for generating a database by converting the results of the dynamic analysis unit and the static analysis unit into a binary data format; A variable for selecting a variable that has a significant influence on a target variable in an item describing characteristics of a file to be analyzed based on the generated database; And a harmfulness discrimination unit for discriminating the harmfulness of the analysis target file using the decision trees of the data mining technique based on the set variables.

Description

Malicious Code Analysis Module and Malicious Code Analysis Method [

The present invention relates to a malicious code analysis module and a malicious code analysis method thereof, and more particularly, to a malicious code analysis module capable of analyzing characteristics of an existing small-capacity executable file by a data mining technique, Module and a method for analyzing the malicious code.

In recent years, the development and diffusion of Internet technology have made a great contribution to the civilized world, which has created positive business effects, such as creating new business opportunities and utilizing resources more efficiently.

However, as the reliance on the Internet has increased, the abuse of the Internet has also increased rapidly, resulting in increased economic and psychological losses. Especially, malicious code that is newly produced or modified is used as a basic means of various application hacking and cyber security threats by bypassing the existing information protection system.

In order to suppress such abuse of Internet dependency, various researches are being conducted to identify new malicious codes which have not been known yet. However, there is little research to identify unknown malicious codes based on small-capacity executables of less than 1 Mbyte, such as EXE extensions, which account for a large portion of actual malicious codes.

Patent Publication No. 10-2012-0078016 (Open date: 2012. 07. 10)

The present invention has been made in order to solve the above-mentioned situation, and it is an object of the present invention to provide a malicious code analysis module and malicious code analysis module capable of analyzing the characteristics of a known small capacity executable file by data mining technique, And a method thereof.

According to an aspect of the present invention, there is provided a malicious code analysis module for analyzing malicious code, comprising: a dynamic analysis unit for performing dynamic analysis on a file to be analyzed; A static analysis unit for performing a static analysis on a file to be analyzed; A database generation unit for generating a database by converting the results of the dynamic analysis unit and the static analysis unit into a binary data format; A variable for selecting a variable that has a significant influence on a target variable in an item describing characteristics of a file to be analyzed based on the generated database; And a harmfulness discrimination unit for discriminating the harmfulness of the analysis target file using the decision trees of the data mining technique based on the set variables.

The malicious code analysis module may further include an inflow country determination unit for determining an inflow country of the analysis target file. In this case, the hazard identification unit identifies the harmfulness of the analysis target file that is imported by each country based on the variables selected by the variable selection unit and the influent country determined by the influential country judgment unit.

The dynamic analysis unit can dynamically identify the malicious code by analyzing the registry frequency, the calling process, and the calling result which are called during execution of the normal file and the malicious file.

In addition, the static analysis unit may statistically identify malicious code using at least one of meta information, application programming interface (API) analysis, and resource analysis.

The variable selection unit compares the average ratio of the normal file and the virus file based on the following equation, and determines that the variable has a significant influence on the target variable when the value of θ is larger than the set value.

.

In addition, the harmfulness identifying unit may be configured to perform at least one of a Classification and Regression Trees (CART) algorithm that performs binary separation using a Gini coefficient or a reduction amount of variance, a Chi-squared Automatic Interaction Detection (CHAID) algorithm that performs a chi-square or F- We use one decision tree algorithm.

In this case, when the CART algorithm is used, the hazard identification unit identifies the probability that the category of the target variable is divided into m categories and classified into the kth category is P1, P2, ... , And Pk, the following equations are defined.

.

According to an aspect of the present invention, there is provided a malicious code analysis method comprising: (a) performing dynamic analysis and static analysis on a file to be analyzed; (b) converting a result of the step (a) into a binary data format to generate a database; (c) selecting a variable having a significant influence on a target variable in an item describing a characteristic of the analysis object file based on the database generated in the step (b); And (d) identifying a harmfulness of a file to be analyzed by using a decision tree of a data mining technique based on the set variables.

The malicious code analysis method may further include (e) determining an entry country of a file to be analyzed. In this case, step (d) identifies the harmfulness of the analyzed file that is imported in each country based on the variable selected in step (c) and the entry country determined in step (e).

Here, the step (a) dynamically identifies the malicious code by analyzing the registry frequency, the calling process, and the calling result which are called during execution of the normal file and the malicious file.

The step (a) statically identifies the malicious code using at least one of meta information, API analysis, and resource analysis.

The step (c) compares the average ratio of the normal file with the virus file based on the following equation, and judges that the value has a significant influence on the target variable when the value of? is larger than the set value, .

.

The step (d) may further comprise the step of determining whether a decision tree algorithm of at least one of a CART algorithm for performing binary separation using a Gini Index or a reduction amount of variance, a CHAID algorithm for performing a chi-square or F- Can be used.

In step (d), when the CART algorithm is used, the probability that the category of the target variable is divided into m and categorized into the k-th category is P1, P2, ... , And Pk, the following equations are defined.

.

According to the present invention, by distinguishing characteristics of existing small-capacity executable files and analyzing them using a data mining technique, it is possible to identify unknown malicious codes, thereby preventing the continually increasing malicious code inflows, Social, economic and psychological damage of the people.

In addition, according to the present invention, it is possible to identify the malicious code that flows into the country from abroad and adaptively respond to the characteristics of the malicious code according to the country, thereby preventing damages that may be caused by the malicious code, It is possible to enable quick response to malicious codes.

1 is a diagram schematically showing a configuration of a malicious code analysis module according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining dynamic analysis of a file to be analyzed, and is an example of registry root key information.
FIG. 3 is a diagram illustrating a dynamic analysis of a file to be analyzed, showing another example of registry root key information.
4 is a diagram illustrating an example of the registry statistical information referred to in the analysis process of the analysis object file.
FIGS. 5 and 6 are diagrams showing examples of a registry identification table that summarizes the dynamic behavior of a file based on the statistical information and the registry action information of FIG.
FIG. 7 is a diagram showing an example of access statistics information of a system file to be generated when dynamic analysis of a file to be analyzed is executed.
FIGS. 8 and 9 are views showing examples of a system file generation action identification table summarized on the basis of the known file generation action information and the statistical information of FIG.
FIG. 10 is a diagram showing an example of a meta information identification table, which is shown for explaining a static analysis of a file to be analyzed.
11 is a diagram showing an example of an Import Table and an Export Table that extract API information from the execution file itself.
12 is a diagram showing an example of API ratios distributed in the analysis object file.
13 to 16 are diagrams showing examples of the API identification table generated by classifying the statistical information and the API function-specific characteristics of FIG.
17 is a diagram showing an example of statistical information on resource information extracted from a file to be analyzed.
18 is a diagram showing an example of an identification table in which a resource name, a resource language, a resource sub-language, and a resource type are classified by referring to the statistical information in Fig.
19 is a diagram showing an example of a change in the selection variable with an increase in the value of theta ([theta]).
FIGS. 20 to 24 are diagrams showing an example in which 134 fields, frequency, average, standard deviation, and theta value are identified for 34,646 files to be analyzed.
25 is a diagram showing an example of a setta ([theta]) independent variable.
26 is a flowchart illustrating a malicious code analysis method according to an embodiment of the present invention.
FIGS. 27 to 30 are views showing a normal classification table according to a decision tree technique.
31 is a diagram showing a decision tree result classification table.
32 and 33 are diagrams illustrating an identification rule of normal / abnormal files.

Hereinafter, a malicious code analysis module and a malicious code analysis method according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1 is a diagram schematically showing a configuration of a malicious code analysis module according to an embodiment of the present invention.

1, a malicious code analysis module 100 according to an exemplary embodiment of the present invention includes a dynamic analysis unit 110, a static analysis unit 120, an incoming country determination unit 130, a database generation unit 140, A variable selection unit 150 and a hazard identification unit 160. [

The dynamic analysis unit 110 performs dynamic analysis on a file to be analyzed. Here, the dynamic analysis unit 110 can dynamically identify the malicious code by analyzing the registry frequency, the calling process, and the calling result which are called during execution of the normal file and the malicious file. At this time, the dynamic analysis unit 110 can perform a dynamic analysis on the analysis object file based on the registry of the window.

The Registry was introduced by Microsoft as a database concept that stores all configuration information of the Windows operating system. The Windows registry information includes configuration values and settings, operating system software information, hardware information, and user preference of the operating system. All operations of the Windows operating system are performed based on the information recorded in the registry, and the registry information continuously changes during user control panel setting, file connection, and system policy change process. In this window registry information, malicious code can be identified by analyzing the registry frequency, the calling process, and the calling result which are called during the execution of the normal file and the malicious file. In the embodiment of the present invention, the registry behavior characteristics of the normal file and the malicious file are converted into a database and analyzed to produce an identification table that can be used for determining whether there is a normal presence. Hereinafter, the process will be described.

First, look at the registry structure. The registry uses a logical directory structure and consists of the top root key (RootKey) and various subkey concepts. The HIVE file that stores the registry information is saved as DEFAULT, SAM, SECURITY, SOFTWARE, SYSTEM file name in the path "% SystemRoot% \ System32 \ Config" and periodically backs up the registry information. Used as a purpose. In the later versions of Windows 2000, five root keys are configured as illustrated in FIG. At this time, there are five data types of registry values used in the registry, as illustrated in FIG.

Next, look at the dynamic behavior of the registry. The HKEY_CURRENT_USER (HKCU) registry provides installed program information such as login user information, MRU list, driver MountPoint, RecentDocs, and Typed URL. By analyzing this, it is possible to identify the behavior characteristics of the file, and can be used as basic information for estimating the existence of malicious code. Here, the hive file that stores the HKCU registry information is NTUSER.dat, and the storage path is "\ Documents and Settings \ account name" or "Users \ account name \".

1) Login user information

The operating system stores the last logged in user's name and connection time. If you access the account created by malicious code, you can check the access information with the information.

* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer

* String: Logon User Name

2) Login user information

Provides program information that starts automatically when logged in to the operating system. It is the registry information that is set up for continuous execution of malicious code.

* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Run

3) My Recent Documents

Provides the name and path information of the most recently viewed file from the logged-in user. In the Windows operating system, the default path is [Start Menu] - [My Recent Documents]. It provides basic information such as recently executed document files, music files, and picture files.

* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorers \ RemoteDocs

* Data type: REG_BINARY

* Data: Number starting from 0, MRUListEx

4) Enter the URL of the Explorer

Internet Explorer Displays URL information typed in the Web browser.

* Path: HKCU \ Software \ Microsoft \ Internet Explorer \ TypedURLs

5) Drive volume information

Provides volume information and autorun-related records of drives connected to the operating system.

* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \

MountPoints2

6) Basic program The file

Windows operating system Provides a list of recently opened files such as Paint, WordPad, and Registry which are basic tools provided. The higher the number of data values in Paint, the more recently viewed files.

* Path (Paint): HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Applets \ Paint \ RecentFileList

- Data type: REG_SZ

- Data: File1, File2, File3 ..

* Path (registry): HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ regedit

- Data type: REG_SZ

- Data: LastKey

* Path (WordPad): HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Wordpad

* Path (SysTray): HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ SysTray

* Tour: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Tour

7) File accessed by dialog box

Use the dialog box to alphabetize the MRUList with the list of recently read or saved files.

* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \ ComDlg32 \ LastVisitedMRU

* Data types: REG_BINARY (a, b, c), REG_SZ (MRUList)

* Data: a, b, c, d ...

8) Commands executed in the execution window

Save the most recently executed command history in alphabetical order to the MRUList data area at the window command prompt. It can be used to confirm the command executed by malicious code.

* Data format: REG_SZ

* Data: a, b, c, d ... MRUList

9) Remote Desktop Access Information

The Windows operating system provides a 'Remote Desktop Connection Service' that basically allows remote access to the operating system. The remote desktop service is used to remotely manage the operating equipment, but it must be activated because it can be used as a hacking point. Similar Windows operating system remote access tools include Radmin and VNC.

* Path: HKCU \ Software \ Microsoft \ TerminalServerClient \ Default

* Data format: REG_SZ

* Data: MRUx, where x is a number starting with 0

10) Registry Favorites

This is used when registering a favorite key position in the registry editor.

* Path: HKCU Software Microsoft Windows CurrentVersion Applets Regedit Favorites

* Data format: REG_SZ

* Data: user-specified favorite name

11) Web browser connection program information

Provides connection program information that can be executed by Internet Explorer, which is a web browser of Windows operating system.

* Path: HKCU Software Microsoft Windows CurrentVersion Explorer FileExs

* Data type: REG_KEY

* Data: Extension name Key value

12) MUICache

Multilingual User Interface Cache (MUICache) is a registry that caches program names to support multiple languages.

* Path: HKCU Software Microsoft Windows CurrentVersion Explorer FileExs

* Data type: REG_KEY

* Data: Extension name Key value

13) UserAssist

UserAssist provides a list of the programs executed, the number of executions, and the last execution time information.

* Path: HKCU Software Microsoft Windows CurrentVersion Explorer UserAssist

14) Other registry information

It is a list of the registry that may cause changes in the file execution of other Windows operating systems.

* Path (SearchAssistant)

- HKCU \ Software \ Microsoft \ SearchAssistant \ ACMruControlPanel

* Path (ControlPannel)

- HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \ ControlPanel

* Path (LAN Computer)

- HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \\ ComputersDescriptions

* Path (MMC Recent File List)

- HKCU \ Software \ Microsoft \ Microsoft Management Console \ Release File List

* Path (MAP Network Drive MRU)

-HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \ MAP Network Drive MRU

* Path (Media Player)

- HKCU \ Software \ Microsoft \ MediaPlayer \ Player \ RecentFileList

Software, which is a hive file in the HKEY_LOCAL_MACHINE registry, provides various information such as installed software information, window information, autorun program, and uninstall information.

15) App Path

This is the registry information that defines the execution path information for the program. It performs a function similar to the Windows operating system system environment variable PATH. In case of malicious code, it can induce arbitrary program execution by manipulating the registry.

* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ App Paths

* Data: Program name. Extension

16) Shell Open Command

You can change the registry of the Shell Open Command to induce arbitrary program execution. Malicious code can manipulate the registry to drive arbitrary program execution.

* Path: HKLM\SOFTWARE\Classes\exefile\shell\open\command

* Data format: REG_SZ

* Data: "% 1"% *

17) Autorun

It is a program that runs automatically when you log in to the Windows operating system. You can see the same settings with the msconfig command provided in Windows. Malicious code can modify the path to induce arbitrary files to execute at startup.

* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Run

* Data format: REG_SZ

* Data: Execution path

18) WinLogon

It stores the logon setting information of the Windows operating system, and can automatically execute malicious code at the time of Windows logon.

* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Winlogon

* Data format: REG_SZ

* Data: Execution path

19) BHO

BHO (Browser Helper Objects) is a function that is provided to expand the function of Internet Explorer. When malicious code manipulates the value, it can control web browser functions or expose key information.

* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Explorer \

Browser Helper Objects

* Data format: REG_SZ

* Data: Execution path

20) WinNT_CV

Provides version information for the Windows operating system, registered user name, registered group name, service pack version, product name, license, and installation period.

* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Winlogon

21) Other Information

* Path (trash information): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Explorer \ BitBucket

* Path (network device name): HKLM \ SOFTWARE \ Microsoft \ Windows \ NT \ CurrentVersion \ NetworkCards

* Path (AppInit_DLLs): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Windows

* Path (ImageFileExecution): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Image File Execution Options

* Path (Uninstall): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Uninstall

* Path (ProfileList): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ ProfileList

The SYSTEM file, which is a hive file in the HKEY_LOCAL_MACHINE registry, provides various items such as installed software information, window information, autorun program, and uninstall information.

22) Shutdown

Provides information on the time and number of times the Windows operating system has been terminated.

* Path (ShutdownTime): HKLM \ System \ ControlSet001 \ Control \ Windows

* Path (ShutdownCount): HKLM \ System \ ControlSet001 \ Control \ Watchdog \ Display

23) TimeZoneInformation

Provides Windows operating system time information.

* Path: HKLM \ System \ ControlSet001 \ Control \ Watchdog \ Display

24) fDenyTSConnections

Provides whether or not the remote desktop function of the Windows operating system is enabled. Malicious code can enable remote desktop functionality for remote access.

* Path: HKLM \ System \ ControlSet001 \ Control \ Terminal Server

25) MountedDevices

Displays the Drive Signature, Device Volume, and registry path device information associated with the operating system.

* Path: HKLM \ System \ ControlSet001 \ Enum \ CountedDevices

26) Network Key

(Interface name), connection information (IP address, subnet mask, gateway) of the operating system.

* Path: HKLM \ System \ ControlSet001 \ Control \ Network

HKLM \ System \ ControlSet001 \ Services \ Tcpip \ Parameters \ Interfaces

27) Firewall settings

Provides firewall configuration information (firewall activation, global open port, allowed programs and port information, etc.) of the Windows operating system. Malicious code can change the firewall settings to prevent malicious activity from blocking the firewall.

* Path: HKLM\System\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\StandardProfile

28) USB storage device

Provides USB storage device information (USB device name, last connection time, serial number, etc.) connected to Windows operating system.

* Path: HKLM \ System ControlSet001 \ Enum \ USBStor

29) DeviceClasses

(Time information, device code), volume information (time information, ParentPrefixID), and the like for the storage device connected to the Windows operating system.

* Path: HKLM \ System ControlSet001 \ DeviceClasses

30) IDE

Provides connection information for IDE devices to the Windows operating system.

* Path: HKLM \ System \ ControlSet001 \ Enum \ IDE

31) Services

Provides driver information and services (service name and path, service registration time, etc.) used in the Windows operating system.

* Path: HKLM \ System \ ControlSet001 \ Services

The security file that stores the HKEY_LOCAL_MACHINE registry information provides computer audit security policy information such as system events and login events in the Windows operating system. The audit policy registry consists of 64 bytes of string, divided into 8 policies each divided into 8 policies. If the data is set to 1, the success item is audited. If it is set to 2, the failure item is audited. If the data is set to 3, both the success and failure can be audited.

Next, an example of an identification table of the registry dynamic behavior will be described.

In case of malicious code, it is necessary to register auto-run, register / delete Windows firewall, enable remote access terminal, Service registration, and so on. That is, it can be used as basic data that can classify characteristics of normal files and malicious files through observation of changes in the registry. 4 shows an example of the registry statistical information referenced in the analysis file execution process. 5 and 6 are diagrams showing examples of a registry identification table that summarizes the dynamic behavior of a file based on the statistical information and the registry action information of FIG. In the embodiment of the present invention, the identification table shown in FIG. 5 and FIG. 6 is utilized as basic data for classifying normal files and malicious files.

When the malicious code is executed, it downloads additional malicious programs, or hides itself in the system folder or trash folder. Or link a file to the startup program path or set up a scheduling operation to induce execution periodically for continuous service execution.

7 is a diagram showing an example of system file access statistical information generated when dynamic analysis of a file to be analyzed is executed. In this case, the standard deviation is 27.71.

FIGS. 8 and 9 are views showing examples of a system file generation action identification table summarized on the basis of the known file generation action information and the statistical information of FIG. In the embodiment of the present invention, the identification information shown in Figs. 8 and 9 is used as basic information for distinguishing the characteristics of the normal file and the malicious file.

As described above, the dynamic analysis unit 110 of the embodiment of the present invention dynamically analyzes malicious code with reference to registry change information, system generated files, and the like, and analyzes data and extracted information of existing analysis target files To produce a dynamic analysis identification table.

The static analysis unit 120 performs a static analysis on a file to be analyzed. Here, the static analysis unit may statistically identify (120) the malicious code using at least one of meta information, API analysis, and resource analysis.

In general, the software marks the name, version information, product name, etc. of the program as meta information during the production process. Such attribute information can be used as basic information for identifying purpose and use of software, and meta information may be different depending on the developer's development environment. 10 is a diagram showing an example of an identification table defined by examining the meta information containing ratio of the analysis object file in the embodiment of the present invention. Here, the top 10 meta information of the file occupies the majority at a ratio of 91.95%, and the standard deviation is evenly distributed at 0.03%.

API (Application Programming Interface) is a predetermined method provided by operating system or programming language so that application program can use system resources. The API provides interfaces for file control, window control, image processing, character control, etc., and calls the resources inside the program to use system resources or interact with other applications [28]. In the Windows operating system, Windows API functions necessary for application program operation are provided by linking a dynamic library (Dynamic Link Library) file or statically including it in the file itself. The Windows API operates in user mode and kernel mode, and the API operating in kernel mode is called NativeAPI. The feature that the file operates can extract and compare API information for each property. As a method of extracting API information, it is possible to analyze the binary file itself or to hook the information of the API which is called during the program operation. As a self analysis method of a portable executable (PE) file executable in the Windows operating system, (SSDT) hooking or IDT (Interrupt Descriptor Table) hooking. SSDT hooking is a function of a table used in the kernel, And the IDT hooking is a method of extracting the API by changing the interrupt processing path of the IDT holding the API information.

In the embodiment of the present invention, API information included in the binary IAT is extracted using pefile, a PE file analysis module provided by python. FIG. 11 is an example of an Import Table and an Export Table that extract API information from an execution file itself, and FIG. 12 is an example of an API ratio distributed in an analysis object file. In the Import Table, APIs such as GetProcAddress and ExitProcess show a high rate.

13 to 16 are diagrams showing examples of the API identification table generated by classifying the statistical information and the API function-specific characteristics of FIG. The groups of API functions are classified into file, registry, process, command execution, and network communication.

In the structure of a PE (Portable Executable) file, IMAGE_NT_HEADER contains information such as the number of sections, time information, and execution attributes. SECTION TABLE holds code, data, resources, and debug information. In the case of a resource, it includes information such as a resource name, a resource language, and a resource type. Such information can be used as basic information for classifying the characteristics of a file. 17 is a diagram showing an example of statistical information on resource information extracted from a file to be analyzed, and FIG. 18 is an example of an identification table in which a resource name, a resource language, a resource sub- Fig.

In the embodiment of the present invention, a static analysis identification table is produced based on the static information of the file to be analyzed.

The influent country judgment unit 130 judges the entry country of the analysis target file. In general, when a file is imported from another country, it can be determined based on the network information that an incoming country in which the file is imported is present. The incoming country judge determines the country of entry from which country the analysis target file was imported.

The database generation unit 140 converts the results of the dynamic analysis unit 110, the static analysis unit 120, and the incoming country determination unit 130 into a binary data format to generate a database. That is, the database generation unit 140 stores the identification table generated through the dynamic analysis unit 110 and the static analysis unit 120 as a database, and stores the identification table, which is determined by the influent country determination unit 130, Into an ID code corresponding thereto and stores the database.

The variable selecting unit 150 selects a target variable, that is, a variable that can have a significant influence on the specific software, in the item describing the characteristics of the analysis object file based on the database generated by the database generating unit 140. At this time, the variable selecting unit 150 compares the average ratio of the normal file with the virus file on the basis of Equation (1), and determines that the target variable has a significant influence when the value of? Is larger than the set value, Can be selected.

[Equation 1]

.

In this way, independent variables that have a significant effect on the target variable can be checked and selected before performing the decision tree. In the embodiment of the present invention, three values are selected based on the variation width of the independent variable with the increase of theta () value. In other words, 119 variables (88.81%) were independent variables when the theta value was 0, 63 variables (47.01%) were independent variables when the value was 1.3, and 12 variables (8.96% Respectively. 19 shows an example of a change in the selection variable with an increase in theta ([theta]) value.

FIGS. 20 to 24 are diagrams showing an example of identifying 134 fields, frequency, average, standard deviation, and theta value for 34,646 files to be analyzed. Here, the item of the frequency is present when it is 1, and is not exist when it is 0. In order to select the independent variable supporting the target variable, we selected the variable when the value of theta () is equal to or more than 1.3, or equal to or more than 2.5. There are a total of 72 variables such as metadata (basic information), packer survey, virtual machine detection technique, API analysis (ANTI debugging), resource analysis, and static analysis item selected when theta () A total of 47 variables were selected, including system file creation, registry execution, and program execution. The values of all the independent variables are binary data of 0 and 1, and details are shown in FIG.

The hazard identification unit 160 identifies the hazard of the analysis object file using the decision tree of the data mining technique based on the variable selected by the variable selection unit 150. [ Here, the hazard identification unit 160 may use at least one decision tree algorithm among a CART algorithm for performing binary separation using a Gini coefficient or a reduction amount of variance, and a CHAID algorithm for performing chi-square or F-test .

In this case, when the CART algorithm is used, the hazard identification unit 160 divides the category of the target variable into m, and the probability that the category is classified into the kth category is P1, P2, ... , And Pk, respectively.

&Quot; (2) "

.

CART selects the independent variable that greatly reduces the Gini index and the optimal separation of the variable as the child node. Equation 3 is used to calculate the amount of decrease in the Gini index used in this process.

&Quot; (3) "

In other words, child nodes are formed by independent variables that maximally reduce impurity when separated into child nodes and then optimal separation. This is equivalent to minimizing the weighted sum of impurity in the child node as shown in Equation (4).

&Quot; (4) "

In the examples of the present invention, the presence or absence of virus was examined using 43 disclosed vaccines. In this case, although a specific file can be easily identified when it is a known virus, a relatively safe file can be identified as a virus because of differences in file examination standards of a computer virus vaccine program. Therefore, additional studies are needed to identify appropriate criteria for whether viruses should be identified as viruses in a single vaccine. In the examples of the present invention, experiments were conducted by dividing into 5 cases, 20 cases, and 40 cases when the number of virus diagnosis vaccines was one or more. These standards can be changed by the strength of the security policy on the identification of viruses in enterprises and organizations.

26 is a flowchart illustrating a malicious code analysis method according to an embodiment of the present invention. The malicious code analysis method according to the embodiment of the present invention can be performed by the malicious code analysis module 100 shown in FIG.

1 to 26, the malicious code analysis module 100 performs a dynamic analysis and a static analysis on a file to be analyzed (S110). At this time, the malicious code analysis module 100 can dynamically identify the malicious code by analyzing the registry frequency, the calling process, and the calling result which are called during execution of the normal file and the malicious file. Also, the malicious code analysis module 100 can statistically identify the malicious code using at least one of meta information, API analysis, and resource analysis.

The malicious code analysis module 100 determines the entry country of the analysis target file (S120). In general, when a file is imported from another country, it can be determined based on the network information that an incoming country in which the file is imported is present. The malicious code analysis module 100 determines the country of entry from which country the file to be analyzed is inflowed.

The malicious code analysis module 100 generates a database by converting the results of the dynamic analysis and the static analysis of the analysis target file and the influent country into which the analysis target file is inputted into a binary data format (S130). That is, the malicious code analysis module 100 stores the identification table generated by the dynamic analysis and the static analysis on the analysis target file as a database, and converts the infected country determined for the analysis target file into an identification code corresponding thereto .

In step S140, the malicious code analysis module 100 selects a target variable, that is, a variable that can have a significant effect on the specific software, in the item describing the characteristic of the analysis object file based on the generated database. At this time, the malicious code analysis module 100 compares the average ratio of the normal file with the virus file based on Equation (1), and judges that the target variable has a significant influence when the value of? Is larger than the set value It can be selected as a variable.

In this way, independent variables that have a significant effect on the target variable can be checked and selected before performing the decision tree. In the embodiment of the present invention, three values are selected based on the variation width of the independent variable with the increase of theta () value. In other words, 119 variables (88.81%) were independent variables when the theta value was 0, 63 variables (47.01%) were independent variables when the value was 1.3, and 12 variables (8.96% Respectively.

The malicious code analysis module 100 identifies the harmfulness of the analysis object file using the decision tree of the data mining technique based on the selected variable (S150). Here, the malicious code analysis module 100 can use at least one decision tree algorithm among the CART algorithm for performing binary separation using the Gini coefficient or the reduction amount of variance, and the CHAID algorithm for performing chi-square or F-test have. In this case, when the CART algorithm is used, the malicious code analysis module 100 divides the category of the target variable into m, and the probability that the category is classified into the kth category is P1, P2, ... , And Pk, respectively.

In the embodiment of the present invention, a decision tree analysis was performed based on the above-described 34,646 pieces (100%) of the above-mentioned analysis data in order to develop a hazard prediction model for malicious code suspicious files. At this time, 24,252 (70%) were used in the whole analysis data and 10,394 (30%) were used in the prediction data. In addition, the number of vaccine detections to detect viruses is classified into four categories (more than 1, more than 5, more than 20, more than 40), and theta (θ) Respectively. The mincriterion value for the ctree_control function is set to 0.99, the minsplit value is set to 1, and the maxdepth value is set to 10, as R-Studio setting values for decision tree execution.

As a result of the analysis, five or more malware were identified as the result of the detection of the vaccine, and the classification standard with less than 5 files as the normal file and the theta (θ) value were 0, the standard deviation was low with 119 independent variables, Respectively. Also, it was confirmed that accuracy decreases when the value of theta (θ) increases and the independent variable decreases. Details are as shown in Figs. 27 to 30.

In FIG. 31, the malicious code identification criterion is defined as 5 or more and the verification result when the setta () value is 0 will be described in detail as follows. The classification accuracy (accuracy) was 84.56% and the misclassification rate was 15.44% in the verification of the classification data for the learning data (70% of the total classification data). The specificity of predicting the normal file to the normal file 81.03%, and the sensitivity of predicting actual virus as a virus was 87.05%. Based on the classification model derived by using the learning data, the verification data (30% of the total analysis) was analyzed, and the classification accuracy (accuracy) was 84.42% and the misclassification rate was 15.58%.

In addition, the specificity of predicting a normal file as a normal file was 80.39%, and the sensitivity of predicting actual virus as a virus was 87.20%, which was not significantly different from that of learning data. 31 is a decision tree classification table used to identify a harmful characteristic of a malicious suspicious file. In the actual malicious suspicious file classification process, the maximum tree depth is set to 10, but the maximum tree depth is defined as 4 because it is difficult to identify the classification table by the figure. The top level separation criterion of the prediction model was chosen as the static analysis item manufacturer name (ME06), followed by dynamic analysis items such as registry office access (R22) and multimedia access (RC25).

In FIG. 31, the rule for distinguishing a normal file from a virus is shown in FIG. 32 and FIG. 33 by arranging the identification rule centering on a leaf node. When each rule is satisfied, it is classified into a normal file or virus, and the identification criterion item is changed according to the separation stopping rule.

In the embodiment of the present invention, a decision tree model capable of identifying malicious codes and normal files is performed. (3.02), accuracy (84.56%), and sensitivity (84.56%) for normal and malicious files were found to be low in the decision models generated by classifying normal and malicious files based on the number of detected viruses 87.05%) and specificity (81.03%) were high.

Claims

A malicious code analysis module system for malicious code analysis,
A dynamic analysis unit for performing dynamic analysis on a file to be analyzed;
A static analyzer for performing a static analysis on the file to be analyzed;
A database generation unit for generating a database by converting the results of the dynamic analysis unit and the static analysis unit into a binary data format;
A variable for selecting a variable that has a significant influence on a target variable in an item describing characteristics of a file to be analyzed based on the generated database; And
And a harmfulness discrimination unit for discriminating the harmfulness of the analysis object file using the decision tree of the data mining technique based on the set variable,
The harmfulness-
At least one of a Classification and Regression Trees (CART) algorithm for performing binary separation using a Gini Index or a reduction amount of variance, and a CHAID (Chi-squared Automatic Interaction Detection) algorithm for performing a chi-square or F- Wherein the malicious code analysis module system uses a decision tree algorithm.

The method according to claim 1,
An incoming country judging unit for judging an entry country of a file to be analyzed;
Further comprising:
Wherein the harmfulness identification unit identifies a harmfulness of a file to be analyzed, which is inputted for each country based on a variable selected by the variable selection unit and an entry country determined by the entry country determination unit.

The method according to claim 1,
Wherein the dynamic analysis unit comprises:
And the malicious code is dynamically identified by analyzing the registry frequency, the calling process, and the calling result which are called during execution of the normal file and the malicious file.

The method according to claim 1,
Wherein the static analysis unit comprises:
Wherein the malicious code is statically identified using at least one of meta information, application programming interface (API) analysis, and resource analysis.

The method according to claim 1,
The variable-
Wherein the average ratio of the normal file to the virus file is compared on the basis of the following expression, and when the value of? Is larger than the set value, it is determined that the target variable has a significant influence, Analysis module system:

.

delete

The method according to claim 1,
The harmfulness-
When the CART algorithm is used, the target variable's category is divided into m, and the probability that it is classified into the k-th category is P1, P2, ... And Pk, the malicious code analysis module system is defined as follows:

.

In a malicious code analysis method performed by a malicious code analysis module system,
(a) performing a dynamic analysis and a static analysis on a file to be analyzed;
(b) generating a database by converting the result of the step (a) into a binary data format;
(c) selecting a variable having a significant influence on a target variable in an item describing characteristics of the analysis object file based on the database generated by the step (b); And
(d) identifying a harmfulness of a file to be analyzed using a decision tree of a data mining technique based on a set variable,
Wherein the step (d) uses an algorithm of at least one decision tree among a CART algorithm for performing binary separation using a Gini coefficient or a reduction amount of variance, and a CHAID algorithm for performing a chi-square or F-test How to analyze malicious code.

9. The method of claim 8,
(e) determining an entry country of the analysis target file;
Further comprising:
Wherein the step (d) identifies the harmfulness of a file to be analyzed, which is imported for each country based on the variable selected by the step (c) and the entry country determined by the step (e) Way.

9. The method of claim 8,
The step (a)
Wherein the malicious code is dynamically identified by analyzing a registry frequency, a calling process, and a calling result which are called during the execution of the normal file and the malicious file.

9. The method of claim 8,
The step (a)
Wherein the malicious code is statically identified using at least one of meta information, API analysis, and resource analysis.

9. The method of claim 8,
The step (c)
Wherein the average ratio of the normal file to the virus file is compared on the basis of the following expression, and when the value of? Is larger than the set value, it is determined that the target variable has a significant influence, Analysis method:

.

delete

9. The method of claim 8,
The step (d)
When the CART algorithm is used, the target variable's category is divided into m, and the probability that it is classified into the k-th category is P1, P2, ... , And Pk, the following formula is defined:

.