KR101819322B1 - Malicious Code Analysis Module and Method therefor - Google Patents
Malicious Code Analysis Module and Method therefor Download PDFInfo
- Publication number
- KR101819322B1 KR101819322B1 KR1020160031262A KR20160031262A KR101819322B1 KR 101819322 B1 KR101819322 B1 KR 101819322B1 KR 1020160031262 A KR1020160031262 A KR 1020160031262A KR 20160031262 A KR20160031262 A KR 20160031262A KR 101819322 B1 KR101819322 B1 KR 101819322B1
- Authority
- KR
- South Korea
- Prior art keywords
- file
- analysis
- malicious code
- variable
- analyzed
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G06F17/30539—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Algebra (AREA)
- Computational Mathematics (AREA)
- Virology (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Health & Medical Sciences (AREA)
- Pure & Applied Mathematics (AREA)
- Databases & Information Systems (AREA)
- Debugging And Monitoring (AREA)
Abstract
A malicious code analysis module and a malicious code analysis method thereof are disclosed. In the malicious code analysis module according to the present invention, the malicious code analysis module to be analyzed includes a dynamic analysis unit for performing a dynamic analysis on a file to be analyzed; A static analysis unit for performing a static analysis on a file to be analyzed; A database generation unit for generating a database by converting the results of the dynamic analysis unit and the static analysis unit into a binary data format; A variable for selecting a variable that has a significant influence on a target variable in an item describing characteristics of a file to be analyzed based on the generated database; And a harmfulness discrimination unit for discriminating the harmfulness of the analysis target file using the decision trees of the data mining technique based on the set variables.
Description
The present invention relates to a malicious code analysis module and a malicious code analysis method thereof, and more particularly, to a malicious code analysis module capable of analyzing characteristics of an existing small-capacity executable file by a data mining technique, Module and a method for analyzing the malicious code.
In recent years, the development and diffusion of Internet technology have made a great contribution to the civilized world, which has created positive business effects, such as creating new business opportunities and utilizing resources more efficiently.
However, as the reliance on the Internet has increased, the abuse of the Internet has also increased rapidly, resulting in increased economic and psychological losses. Especially, malicious code that is newly produced or modified is used as a basic means of various application hacking and cyber security threats by bypassing the existing information protection system.
In order to suppress such abuse of Internet dependency, various researches are being conducted to identify new malicious codes which have not been known yet. However, there is little research to identify unknown malicious codes based on small-capacity executables of less than 1 Mbyte, such as EXE extensions, which account for a large portion of actual malicious codes.
The present invention has been made in order to solve the above-mentioned situation, and it is an object of the present invention to provide a malicious code analysis module and malicious code analysis module capable of analyzing the characteristics of a known small capacity executable file by data mining technique, And a method thereof.
According to an aspect of the present invention, there is provided a malicious code analysis module for analyzing malicious code, comprising: a dynamic analysis unit for performing dynamic analysis on a file to be analyzed; A static analysis unit for performing a static analysis on a file to be analyzed; A database generation unit for generating a database by converting the results of the dynamic analysis unit and the static analysis unit into a binary data format; A variable for selecting a variable that has a significant influence on a target variable in an item describing characteristics of a file to be analyzed based on the generated database; And a harmfulness discrimination unit for discriminating the harmfulness of the analysis target file using the decision trees of the data mining technique based on the set variables.
The malicious code analysis module may further include an inflow country determination unit for determining an inflow country of the analysis target file. In this case, the hazard identification unit identifies the harmfulness of the analysis target file that is imported by each country based on the variables selected by the variable selection unit and the influent country determined by the influential country judgment unit.
The dynamic analysis unit can dynamically identify the malicious code by analyzing the registry frequency, the calling process, and the calling result which are called during execution of the normal file and the malicious file.
In addition, the static analysis unit may statistically identify malicious code using at least one of meta information, application programming interface (API) analysis, and resource analysis.
The variable selection unit compares the average ratio of the normal file and the virus file based on the following equation, and determines that the variable has a significant influence on the target variable when the value of θ is larger than the set value.
.
In addition, the harmfulness identifying unit may be configured to perform at least one of a Classification and Regression Trees (CART) algorithm that performs binary separation using a Gini coefficient or a reduction amount of variance, a Chi-squared Automatic Interaction Detection (CHAID) algorithm that performs a chi-square or F- We use one decision tree algorithm.
In this case, when the CART algorithm is used, the hazard identification unit identifies the probability that the category of the target variable is divided into m categories and classified into the kth category is P1, P2, ... , And Pk, the following equations are defined.
.
According to an aspect of the present invention, there is provided a malicious code analysis method comprising: (a) performing dynamic analysis and static analysis on a file to be analyzed; (b) converting a result of the step (a) into a binary data format to generate a database; (c) selecting a variable having a significant influence on a target variable in an item describing a characteristic of the analysis object file based on the database generated in the step (b); And (d) identifying a harmfulness of a file to be analyzed by using a decision tree of a data mining technique based on the set variables.
The malicious code analysis method may further include (e) determining an entry country of a file to be analyzed. In this case, step (d) identifies the harmfulness of the analyzed file that is imported in each country based on the variable selected in step (c) and the entry country determined in step (e).
Here, the step (a) dynamically identifies the malicious code by analyzing the registry frequency, the calling process, and the calling result which are called during execution of the normal file and the malicious file.
The step (a) statically identifies the malicious code using at least one of meta information, API analysis, and resource analysis.
The step (c) compares the average ratio of the normal file with the virus file based on the following equation, and judges that the value has a significant influence on the target variable when the value of? is larger than the set value, .
.
The step (d) may further comprise the step of determining whether a decision tree algorithm of at least one of a CART algorithm for performing binary separation using a Gini Index or a reduction amount of variance, a CHAID algorithm for performing a chi-square or F- Can be used.
In step (d), when the CART algorithm is used, the probability that the category of the target variable is divided into m and categorized into the k-th category is P1, P2, ... , And Pk, the following equations are defined.
.
According to the present invention, by distinguishing characteristics of existing small-capacity executable files and analyzing them using a data mining technique, it is possible to identify unknown malicious codes, thereby preventing the continually increasing malicious code inflows, Social, economic and psychological damage of the people.
In addition, according to the present invention, it is possible to identify the malicious code that flows into the country from abroad and adaptively respond to the characteristics of the malicious code according to the country, thereby preventing damages that may be caused by the malicious code, It is possible to enable quick response to malicious codes.
1 is a diagram schematically showing a configuration of a malicious code analysis module according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining dynamic analysis of a file to be analyzed, and is an example of registry root key information.
FIG. 3 is a diagram illustrating a dynamic analysis of a file to be analyzed, showing another example of registry root key information.
4 is a diagram illustrating an example of the registry statistical information referred to in the analysis process of the analysis object file.
FIGS. 5 and 6 are diagrams showing examples of a registry identification table that summarizes the dynamic behavior of a file based on the statistical information and the registry action information of FIG.
FIG. 7 is a diagram showing an example of access statistics information of a system file to be generated when dynamic analysis of a file to be analyzed is executed.
FIGS. 8 and 9 are views showing examples of a system file generation action identification table summarized on the basis of the known file generation action information and the statistical information of FIG.
FIG. 10 is a diagram showing an example of a meta information identification table, which is shown for explaining a static analysis of a file to be analyzed.
11 is a diagram showing an example of an Import Table and an Export Table that extract API information from the execution file itself.
12 is a diagram showing an example of API ratios distributed in the analysis object file.
13 to 16 are diagrams showing examples of the API identification table generated by classifying the statistical information and the API function-specific characteristics of FIG.
17 is a diagram showing an example of statistical information on resource information extracted from a file to be analyzed.
18 is a diagram showing an example of an identification table in which a resource name, a resource language, a resource sub-language, and a resource type are classified by referring to the statistical information in Fig.
19 is a diagram showing an example of a change in the selection variable with an increase in the value of theta ([theta]).
FIGS. 20 to 24 are diagrams showing an example in which 134 fields, frequency, average, standard deviation, and theta value are identified for 34,646 files to be analyzed.
25 is a diagram showing an example of a setta ([theta]) independent variable.
26 is a flowchart illustrating a malicious code analysis method according to an embodiment of the present invention.
FIGS. 27 to 30 are views showing a normal classification table according to a decision tree technique.
31 is a diagram showing a decision tree result classification table.
32 and 33 are diagrams illustrating an identification rule of normal / abnormal files.
Hereinafter, a malicious code analysis module and a malicious code analysis method according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.
1 is a diagram schematically showing a configuration of a malicious code analysis module according to an embodiment of the present invention.
1, a malicious
The
The Registry was introduced by Microsoft as a database concept that stores all configuration information of the Windows operating system. The Windows registry information includes configuration values and settings, operating system software information, hardware information, and user preference of the operating system. All operations of the Windows operating system are performed based on the information recorded in the registry, and the registry information continuously changes during user control panel setting, file connection, and system policy change process. In this window registry information, malicious code can be identified by analyzing the registry frequency, the calling process, and the calling result which are called during the execution of the normal file and the malicious file. In the embodiment of the present invention, the registry behavior characteristics of the normal file and the malicious file are converted into a database and analyzed to produce an identification table that can be used for determining whether there is a normal presence. Hereinafter, the process will be described.
First, look at the registry structure. The registry uses a logical directory structure and consists of the top root key (RootKey) and various subkey concepts. The HIVE file that stores the registry information is saved as DEFAULT, SAM, SECURITY, SOFTWARE, SYSTEM file name in the path "% SystemRoot% \ System32 \ Config" and periodically backs up the registry information. Used as a purpose. In the later versions of Windows 2000, five root keys are configured as illustrated in FIG. At this time, there are five data types of registry values used in the registry, as illustrated in FIG.
Next, look at the dynamic behavior of the registry. The HKEY_CURRENT_USER (HKCU) registry provides installed program information such as login user information, MRU list, driver MountPoint, RecentDocs, and Typed URL. By analyzing this, it is possible to identify the behavior characteristics of the file, and can be used as basic information for estimating the existence of malicious code. Here, the hive file that stores the HKCU registry information is NTUSER.dat, and the storage path is "\ Documents and Settings \ account name" or "Users \ account name \".
1) Login user information
The operating system stores the last logged in user's name and connection time. If you access the account created by malicious code, you can check the access information with the information.
* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer
* String: Logon User Name
2) Login user information
Provides program information that starts automatically when logged in to the operating system. It is the registry information that is set up for continuous execution of malicious code.
* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Run
3) My Recent Documents
Provides the name and path information of the most recently viewed file from the logged-in user. In the Windows operating system, the default path is [Start Menu] - [My Recent Documents]. It provides basic information such as recently executed document files, music files, and picture files.
* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorers \ RemoteDocs
* Data type: REG_BINARY
* Data: Number starting from 0, MRUListEx
4) Enter the URL of the Explorer
Internet Explorer Displays URL information typed in the Web browser.
* Path: HKCU \ Software \ Microsoft \ Internet Explorer \ TypedURLs
5) Drive volume information
Provides volume information and autorun-related records of drives connected to the operating system.
* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \
MountPoints2
6) Basic program The file
Windows operating system Provides a list of recently opened files such as Paint, WordPad, and Registry which are basic tools provided. The higher the number of data values in Paint, the more recently viewed files.
* Path (Paint): HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Applets \ Paint \ RecentFileList
- Data type: REG_SZ
- Data: File1, File2, File3 ..
* Path (registry): HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ regedit
- Data type: REG_SZ
- Data: LastKey
* Path (WordPad): HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Wordpad
* Path (SysTray): HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ SysTray
* Tour: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Tour
7) File accessed by dialog box
Use the dialog box to alphabetize the MRUList with the list of recently read or saved files.
* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \ ComDlg32 \ LastVisitedMRU
* Data types: REG_BINARY (a, b, c), REG_SZ (MRUList)
* Data: a, b, c, d ...
8) Commands executed in the execution window
Save the most recently executed command history in alphabetical order to the MRUList data area at the window command prompt. It can be used to confirm the command executed by malicious code.
* Path: HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \ ComDlg32 \ LastVisitedMRU
* Data format: REG_SZ
* Data: a, b, c, d ... MRUList
9) Remote Desktop Access Information
The Windows operating system provides a 'Remote Desktop Connection Service' that basically allows remote access to the operating system. The remote desktop service is used to remotely manage the operating equipment, but it must be activated because it can be used as a hacking point. Similar Windows operating system remote access tools include Radmin and VNC.
* Path: HKCU \ Software \ Microsoft \ TerminalServerClient \ Default
* Data format: REG_SZ
* Data: MRUx, where x is a number starting with 0
10) Registry Favorites
This is used when registering a favorite key position in the registry editor.
* Path: HKCU Software Microsoft Windows CurrentVersion Applets Regedit Favorites
* Data format: REG_SZ
* Data: user-specified favorite name
11) Web browser connection program information
Provides connection program information that can be executed by Internet Explorer, which is a web browser of Windows operating system.
* Path: HKCU Software Microsoft Windows CurrentVersion Explorer FileExs
* Data type: REG_KEY
* Data: Extension name Key value
12) MUICache
Multilingual User Interface Cache (MUICache) is a registry that caches program names to support multiple languages.
* Path: HKCU Software Microsoft Windows CurrentVersion Explorer FileExs
* Data type: REG_KEY
* Data: Extension name Key value
13) UserAssist
UserAssist provides a list of the programs executed, the number of executions, and the last execution time information.
* Path: HKCU Software Microsoft Windows CurrentVersion Explorer UserAssist
14) Other registry information
It is a list of the registry that may cause changes in the file execution of other Windows operating systems.
* Path (SearchAssistant)
- HKCU \ Software \ Microsoft \ SearchAssistant \ ACMruControlPanel
* Path (ControlPannel)
- HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \ ControlPanel
* Path (LAN Computer)
- HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \\ ComputersDescriptions
* Path (MMC Recent File List)
- HKCU \ Software \ Microsoft \ Microsoft Management Console \ Release File List
* Path (MAP Network Drive MRU)
-HKCU \ Software \ Microsoft \ Windows \ CurrentVersion \ Explorer \ MAP Network Drive MRU
* Path (Media Player)
- HKCU \ Software \ Microsoft \ MediaPlayer \ Player \ RecentFileList
Software, which is a hive file in the HKEY_LOCAL_MACHINE registry, provides various information such as installed software information, window information, autorun program, and uninstall information.
15) App Path
This is the registry information that defines the execution path information for the program. It performs a function similar to the Windows operating system system environment variable PATH. In case of malicious code, it can induce arbitrary program execution by manipulating the registry.
* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ App Paths
* Data: Program name. Extension
16) Shell Open Command
You can change the registry of the Shell Open Command to induce arbitrary program execution. Malicious code can manipulate the registry to drive arbitrary program execution.
* Path: HKLM\SOFTWARE\Classes\exefile\shell\open\command
* Data format: REG_SZ
* Data: "% 1"% *
17) Autorun
It is a program that runs automatically when you log in to the Windows operating system. You can see the same settings with the msconfig command provided in Windows. Malicious code can modify the path to induce arbitrary files to execute at startup.
* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Run
* Data format: REG_SZ
* Data: Execution path
18) WinLogon
It stores the logon setting information of the Windows operating system, and can automatically execute malicious code at the time of Windows logon.
* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Winlogon
* Data format: REG_SZ
* Data: Execution path
19) BHO
BHO (Browser Helper Objects) is a function that is provided to expand the function of Internet Explorer. When malicious code manipulates the value, it can control web browser functions or expose key information.
* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Explorer \
Browser Helper Objects
* Data format: REG_SZ
* Data: Execution path
20) WinNT_CV
Provides version information for the Windows operating system, registered user name, registered group name, service pack version, product name, license, and installation period.
* Path: HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Winlogon
21) Other Information
* Path (trash information): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Explorer \ BitBucket
* Path (network device name): HKLM \ SOFTWARE \ Microsoft \ Windows \ NT \ CurrentVersion \ NetworkCards
* Path (AppInit_DLLs): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Windows
* Path (ImageFileExecution): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Image File Execution Options
* Path (Uninstall): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ Uninstall
* Path (ProfileList): HKLM \ SOFTWARE \ Microsoft \ Windows \ CurrentVersion \ ProfileList
The SYSTEM file, which is a hive file in the HKEY_LOCAL_MACHINE registry, provides various items such as installed software information, window information, autorun program, and uninstall information.
22) Shutdown
Provides information on the time and number of times the Windows operating system has been terminated.
* Path (ShutdownTime): HKLM \ System \ ControlSet001 \ Control \ Windows
* Path (ShutdownCount): HKLM \ System \ ControlSet001 \ Control \ Watchdog \ Display
23) TimeZoneInformation
Provides Windows operating system time information.
* Path: HKLM \ System \ ControlSet001 \ Control \ Watchdog \ Display
24) fDenyTSConnections
Provides whether or not the remote desktop function of the Windows operating system is enabled. Malicious code can enable remote desktop functionality for remote access.
* Path: HKLM \ System \ ControlSet001 \ Control \ Terminal Server
25) MountedDevices
Displays the Drive Signature, Device Volume, and registry path device information associated with the operating system.
* Path: HKLM \ System \ ControlSet001 \ Enum \ CountedDevices
26) Network Key
(Interface name), connection information (IP address, subnet mask, gateway) of the operating system.
* Path: HKLM \ System \ ControlSet001 \ Control \ Network
HKLM \ System \ ControlSet001 \ Services \ Tcpip \ Parameters \ Interfaces
27) Firewall settings
Provides firewall configuration information (firewall activation, global open port, allowed programs and port information, etc.) of the Windows operating system. Malicious code can change the firewall settings to prevent malicious activity from blocking the firewall.
* Path: HKLM\System\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\StandardProfile
28) USB storage device
Provides USB storage device information (USB device name, last connection time, serial number, etc.) connected to Windows operating system.
* Path: HKLM \ System ControlSet001 \ Enum \ USBStor
29) DeviceClasses
(Time information, device code), volume information (time information, ParentPrefixID), and the like for the storage device connected to the Windows operating system.
* Path: HKLM \ System ControlSet001 \ DeviceClasses
30) IDE
Provides connection information for IDE devices to the Windows operating system.
* Path: HKLM \ System \ ControlSet001 \ Enum \ IDE
31) Services
Provides driver information and services (service name and path, service registration time, etc.) used in the Windows operating system.
* Path: HKLM \ System \ ControlSet001 \ Services
The security file that stores the HKEY_LOCAL_MACHINE registry information provides computer audit security policy information such as system events and login events in the Windows operating system. The audit policy registry consists of 64 bytes of string, divided into 8 policies each divided into 8 policies. If the data is set to 1, the success item is audited. If it is set to 2, the failure item is audited. If the data is set to 3, both the success and failure can be audited.
Next, an example of an identification table of the registry dynamic behavior will be described.
In case of malicious code, it is necessary to register auto-run, register / delete Windows firewall, enable remote access terminal, Service registration, and so on. That is, it can be used as basic data that can classify characteristics of normal files and malicious files through observation of changes in the registry. 4 shows an example of the registry statistical information referenced in the analysis file execution process. 5 and 6 are diagrams showing examples of a registry identification table that summarizes the dynamic behavior of a file based on the statistical information and the registry action information of FIG. In the embodiment of the present invention, the identification table shown in FIG. 5 and FIG. 6 is utilized as basic data for classifying normal files and malicious files.
When the malicious code is executed, it downloads additional malicious programs, or hides itself in the system folder or trash folder. Or link a file to the startup program path or set up a scheduling operation to induce execution periodically for continuous service execution.
7 is a diagram showing an example of system file access statistical information generated when dynamic analysis of a file to be analyzed is executed. In this case, the standard deviation is 27.71.
FIGS. 8 and 9 are views showing examples of a system file generation action identification table summarized on the basis of the known file generation action information and the statistical information of FIG. In the embodiment of the present invention, the identification information shown in Figs. 8 and 9 is used as basic information for distinguishing the characteristics of the normal file and the malicious file.
As described above, the
The
In general, the software marks the name, version information, product name, etc. of the program as meta information during the production process. Such attribute information can be used as basic information for identifying purpose and use of software, and meta information may be different depending on the developer's development environment. 10 is a diagram showing an example of an identification table defined by examining the meta information containing ratio of the analysis object file in the embodiment of the present invention. Here, the top 10 meta information of the file occupies the majority at a ratio of 91.95%, and the standard deviation is evenly distributed at 0.03%.
API (Application Programming Interface) is a predetermined method provided by operating system or programming language so that application program can use system resources. The API provides interfaces for file control, window control, image processing, character control, etc., and calls the resources inside the program to use system resources or interact with other applications [28]. In the Windows operating system, Windows API functions necessary for application program operation are provided by linking a dynamic library (Dynamic Link Library) file or statically including it in the file itself. The Windows API operates in user mode and kernel mode, and the API operating in kernel mode is called NativeAPI. The feature that the file operates can extract and compare API information for each property. As a method of extracting API information, it is possible to analyze the binary file itself or to hook the information of the API which is called during the program operation. As a self analysis method of a portable executable (PE) file executable in the Windows operating system, (SSDT) hooking or IDT (Interrupt Descriptor Table) hooking. SSDT hooking is a function of a table used in the kernel, And the IDT hooking is a method of extracting the API by changing the interrupt processing path of the IDT holding the API information.
In the embodiment of the present invention, API information included in the binary IAT is extracted using pefile, a PE file analysis module provided by python. FIG. 11 is an example of an Import Table and an Export Table that extract API information from an execution file itself, and FIG. 12 is an example of an API ratio distributed in an analysis object file. In the Import Table, APIs such as GetProcAddress and ExitProcess show a high rate.
13 to 16 are diagrams showing examples of the API identification table generated by classifying the statistical information and the API function-specific characteristics of FIG. The groups of API functions are classified into file, registry, process, command execution, and network communication.
In the structure of a PE (Portable Executable) file, IMAGE_NT_HEADER contains information such as the number of sections, time information, and execution attributes. SECTION TABLE holds code, data, resources, and debug information. In the case of a resource, it includes information such as a resource name, a resource language, and a resource type. Such information can be used as basic information for classifying the characteristics of a file. 17 is a diagram showing an example of statistical information on resource information extracted from a file to be analyzed, and FIG. 18 is an example of an identification table in which a resource name, a resource language, a resource sub- Fig.
In the embodiment of the present invention, a static analysis identification table is produced based on the static information of the file to be analyzed.
The influent
The
The variable selecting
[Equation 1]
.
In this way, independent variables that have a significant effect on the target variable can be checked and selected before performing the decision tree. In the embodiment of the present invention, three values are selected based on the variation width of the independent variable with the increase of theta () value. In other words, 119 variables (88.81%) were independent variables when the theta value was 0, 63 variables (47.01%) were independent variables when the value was 1.3, and 12 variables (8.96% Respectively. 19 shows an example of a change in the selection variable with an increase in theta ([theta]) value.
FIGS. 20 to 24 are diagrams showing an example of identifying 134 fields, frequency, average, standard deviation, and theta value for 34,646 files to be analyzed. Here, the item of the frequency is present when it is 1, and is not exist when it is 0. In order to select the independent variable supporting the target variable, we selected the variable when the value of theta () is equal to or more than 1.3, or equal to or more than 2.5. There are a total of 72 variables such as metadata (basic information), packer survey, virtual machine detection technique, API analysis (ANTI debugging), resource analysis, and static analysis item selected when theta () A total of 47 variables were selected, including system file creation, registry execution, and program execution. The values of all the independent variables are binary data of 0 and 1, and details are shown in FIG.
The
In this case, when the CART algorithm is used, the
&Quot; (2) "
.
CART selects the independent variable that greatly reduces the Gini index and the optimal separation of the variable as the child node.
&Quot; (3) "
In other words, child nodes are formed by independent variables that maximally reduce impurity when separated into child nodes and then optimal separation. This is equivalent to minimizing the weighted sum of impurity in the child node as shown in Equation (4).
&Quot; (4) "
In the examples of the present invention, the presence or absence of virus was examined using 43 disclosed vaccines. In this case, although a specific file can be easily identified when it is a known virus, a relatively safe file can be identified as a virus because of differences in file examination standards of a computer virus vaccine program. Therefore, additional studies are needed to identify appropriate criteria for whether viruses should be identified as viruses in a single vaccine. In the examples of the present invention, experiments were conducted by dividing into 5 cases, 20 cases, and 40 cases when the number of virus diagnosis vaccines was one or more. These standards can be changed by the strength of the security policy on the identification of viruses in enterprises and organizations.
26 is a flowchart illustrating a malicious code analysis method according to an embodiment of the present invention. The malicious code analysis method according to the embodiment of the present invention can be performed by the malicious
1 to 26, the malicious
The malicious
The malicious
In step S140, the malicious
In this way, independent variables that have a significant effect on the target variable can be checked and selected before performing the decision tree. In the embodiment of the present invention, three values are selected based on the variation width of the independent variable with the increase of theta () value. In other words, 119 variables (88.81%) were independent variables when the theta value was 0, 63 variables (47.01%) were independent variables when the value was 1.3, and 12 variables (8.96% Respectively.
The malicious
In the embodiment of the present invention, a decision tree analysis was performed based on the above-described 34,646 pieces (100%) of the above-mentioned analysis data in order to develop a hazard prediction model for malicious code suspicious files. At this time, 24,252 (70%) were used in the whole analysis data and 10,394 (30%) were used in the prediction data. In addition, the number of vaccine detections to detect viruses is classified into four categories (more than 1, more than 5, more than 20, more than 40), and theta (θ) Respectively. The mincriterion value for the ctree_control function is set to 0.99, the minsplit value is set to 1, and the maxdepth value is set to 10, as R-Studio setting values for decision tree execution.
As a result of the analysis, five or more malware were identified as the result of the detection of the vaccine, and the classification standard with less than 5 files as the normal file and the theta (θ) value were 0, the standard deviation was low with 119 independent variables, Respectively. Also, it was confirmed that accuracy decreases when the value of theta (θ) increases and the independent variable decreases. Details are as shown in Figs. 27 to 30.
In FIG. 31, the malicious code identification criterion is defined as 5 or more and the verification result when the setta () value is 0 will be described in detail as follows. The classification accuracy (accuracy) was 84.56% and the misclassification rate was 15.44% in the verification of the classification data for the learning data (70% of the total classification data). The specificity of predicting the normal file to the normal file 81.03%, and the sensitivity of predicting actual virus as a virus was 87.05%. Based on the classification model derived by using the learning data, the verification data (30% of the total analysis) was analyzed, and the classification accuracy (accuracy) was 84.42% and the misclassification rate was 15.58%.
In addition, the specificity of predicting a normal file as a normal file was 80.39%, and the sensitivity of predicting actual virus as a virus was 87.20%, which was not significantly different from that of learning data. 31 is a decision tree classification table used to identify a harmful characteristic of a malicious suspicious file. In the actual malicious suspicious file classification process, the maximum tree depth is set to 10, but the maximum tree depth is defined as 4 because it is difficult to identify the classification table by the figure. The top level separation criterion of the prediction model was chosen as the static analysis item manufacturer name (ME06), followed by dynamic analysis items such as registry office access (R22) and multimedia access (RC25).
In FIG. 31, the rule for distinguishing a normal file from a virus is shown in FIG. 32 and FIG. 33 by arranging the identification rule centering on a leaf node. When each rule is satisfied, it is classified into a normal file or virus, and the identification criterion item is changed according to the separation stopping rule.
In the embodiment of the present invention, a decision tree model capable of identifying malicious codes and normal files is performed. (3.02), accuracy (84.56%), and sensitivity (84.56%) for normal and malicious files were found to be low in the decision models generated by classifying normal and malicious files based on the number of detected viruses 87.05%) and specificity (81.03%) were high.
Claims (14)
A dynamic analysis unit for performing dynamic analysis on a file to be analyzed;
A static analyzer for performing a static analysis on the file to be analyzed;
A database generation unit for generating a database by converting the results of the dynamic analysis unit and the static analysis unit into a binary data format;
A variable for selecting a variable that has a significant influence on a target variable in an item describing characteristics of a file to be analyzed based on the generated database; And
And a harmfulness discrimination unit for discriminating the harmfulness of the analysis object file using the decision tree of the data mining technique based on the set variable,
The harmfulness-
At least one of a Classification and Regression Trees (CART) algorithm for performing binary separation using a Gini Index or a reduction amount of variance, and a CHAID (Chi-squared Automatic Interaction Detection) algorithm for performing a chi-square or F- Wherein the malicious code analysis module system uses a decision tree algorithm.
An incoming country judging unit for judging an entry country of a file to be analyzed;
Further comprising:
Wherein the harmfulness identification unit identifies a harmfulness of a file to be analyzed, which is inputted for each country based on a variable selected by the variable selection unit and an entry country determined by the entry country determination unit.
Wherein the dynamic analysis unit comprises:
And the malicious code is dynamically identified by analyzing the registry frequency, the calling process, and the calling result which are called during execution of the normal file and the malicious file.
Wherein the static analysis unit comprises:
Wherein the malicious code is statically identified using at least one of meta information, application programming interface (API) analysis, and resource analysis.
The variable-
Wherein the average ratio of the normal file to the virus file is compared on the basis of the following expression, and when the value of? Is larger than the set value, it is determined that the target variable has a significant influence, Analysis module system:
.
The harmfulness-
When the CART algorithm is used, the target variable's category is divided into m, and the probability that it is classified into the k-th category is P1, P2, ... And Pk, the malicious code analysis module system is defined as follows:
.
(a) performing a dynamic analysis and a static analysis on a file to be analyzed;
(b) generating a database by converting the result of the step (a) into a binary data format;
(c) selecting a variable having a significant influence on a target variable in an item describing characteristics of the analysis object file based on the database generated by the step (b); And
(d) identifying a harmfulness of a file to be analyzed using a decision tree of a data mining technique based on a set variable,
Wherein the step (d) uses an algorithm of at least one decision tree among a CART algorithm for performing binary separation using a Gini coefficient or a reduction amount of variance, and a CHAID algorithm for performing a chi-square or F-test How to analyze malicious code.
(e) determining an entry country of the analysis target file;
Further comprising:
Wherein the step (d) identifies the harmfulness of a file to be analyzed, which is imported for each country based on the variable selected by the step (c) and the entry country determined by the step (e) Way.
The step (a)
Wherein the malicious code is dynamically identified by analyzing a registry frequency, a calling process, and a calling result which are called during the execution of the normal file and the malicious file.
The step (a)
Wherein the malicious code is statically identified using at least one of meta information, API analysis, and resource analysis.
The step (c)
Wherein the average ratio of the normal file to the virus file is compared on the basis of the following expression, and when the value of? Is larger than the set value, it is determined that the target variable has a significant influence, Analysis method:
.
The step (d)
When the CART algorithm is used, the target variable's category is divided into m, and the probability that it is classified into the k-th category is P1, P2, ... , And Pk, the following formula is defined:
.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160031262A KR101819322B1 (en) | 2016-03-16 | 2016-03-16 | Malicious Code Analysis Module and Method therefor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160031262A KR101819322B1 (en) | 2016-03-16 | 2016-03-16 | Malicious Code Analysis Module and Method therefor |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20170107665A KR20170107665A (en) | 2017-09-26 |
KR101819322B1 true KR101819322B1 (en) | 2018-02-28 |
Family
ID=60037036
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020160031262A KR101819322B1 (en) | 2016-03-16 | 2016-03-16 | Malicious Code Analysis Module and Method therefor |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101819322B1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20200057402A (en) | 2018-11-16 | 2020-05-26 | 주식회사 베일리테크 | System and method for detecting malignant code based on virtual and real machine |
KR20210155214A (en) * | 2020-06-15 | 2021-12-22 | 한양대학교 산학협력단 | Apparatus and method for detecting malicious code using tracing based on hardware and software |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101988747B1 (en) * | 2017-11-30 | 2019-06-12 | 건국대학교 산학협력단 | Ransomware dectecting method and apparatus based on machine learning through hybrid analysis |
US20200117802A1 (en) | 2018-10-15 | 2020-04-16 | Mcafee, Llc | Systems, methods, and media for identifying and responding to malicious files having similar features |
KR102562215B1 (en) * | 2021-11-26 | 2023-07-31 | 서울여자대학교 산학협력단 | Time-driven evasive malware detection method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101428004B1 (en) * | 2014-04-04 | 2014-08-11 | (주)지란지교소프트 | Method and device for detecting malicious processes outflow data |
JP2016031629A (en) * | 2014-07-29 | 2016-03-07 | 日本電信電話株式会社 | Feature selection device, feature selection system, feature selection method and feature selection program |
-
2016
- 2016-03-16 KR KR1020160031262A patent/KR101819322B1/en active IP Right Grant
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101428004B1 (en) * | 2014-04-04 | 2014-08-11 | (주)지란지교소프트 | Method and device for detecting malicious processes outflow data |
JP2016031629A (en) * | 2014-07-29 | 2016-03-07 | 日本電信電話株式会社 | Feature selection device, feature selection system, feature selection method and feature selection program |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20200057402A (en) | 2018-11-16 | 2020-05-26 | 주식회사 베일리테크 | System and method for detecting malignant code based on virtual and real machine |
KR20210155214A (en) * | 2020-06-15 | 2021-12-22 | 한양대학교 산학협력단 | Apparatus and method for detecting malicious code using tracing based on hardware and software |
KR102421394B1 (en) * | 2020-06-15 | 2022-07-15 | 한양대학교 산학협력단 | Apparatus and method for detecting malicious code using tracing based on hardware and software |
Also Published As
Publication number | Publication date |
---|---|
KR20170107665A (en) | 2017-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101819322B1 (en) | Malicious Code Analysis Module and Method therefor | |
US10581879B1 (en) | Enhanced malware detection for generated objects | |
Bai et al. | A malware detection scheme based on mining format information | |
Galal et al. | Behavior-based features model for malware detection | |
EP3002702B1 (en) | Identifying an evasive malicious object based on a behavior delta | |
US11188650B2 (en) | Detection of malware using feature hashing | |
Salehi et al. | Using feature generation from API calls for malware detection | |
JP5992622B2 (en) | Malicious application diagnostic apparatus and method | |
US9015814B1 (en) | System and methods for detecting harmful files of different formats | |
Devesa et al. | Automatic behaviour-based analysis and classification system for malware detection | |
CN100481101C (en) | Method for computer safety start | |
CN114077741B (en) | Software supply chain safety detection method and device, electronic equipment and storage medium | |
CN104834858A (en) | Method for statically detecting malicious code in android APP (Application) | |
Singh et al. | Experimental analysis of Android malware detection based on combinations of permissions and API-calls | |
Karbalaie et al. | Semantic malware detection by deploying graph mining | |
CN105335655A (en) | Android application safety analysis method based on sensitive behavior identification | |
Gianazza et al. | Puppetdroid: A user-centric ui exerciser for automatic dynamic analysis of similar android applications | |
CN102902919A (en) | Method, device and system for identifying and processing suspicious practices | |
US20170004305A1 (en) | System and method of preventing execution of undesirable programs | |
Han et al. | Malware classification methods using API sequence characteristics | |
Fleck et al. | Pytrigger: A system to trigger & extract user-activated malware behavior | |
Kapratwar | Static and dynamic analysis for android malware detection | |
Aslan | Performance comparison of static malware analysis tools versus antivirus scanners to detect malware | |
Choi et al. | All‐in‐One Framework for Detection, Unpacking, and Verification for Malware Analysis | |
CN105975302A (en) | Application installation method and terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant |