WO2023158071A1

WO2023158071A1 - Malicious code infection detecting system and method through network communication log analysis

Info

Publication number: WO2023158071A1
Application number: PCT/KR2022/019311
Authority: WO
Inventors: 한승연; 김유태
Original assignee: 주식회사 리니어리티
Priority date: 2022-02-21
Filing date: 2022-12-01
Publication date: 2023-08-24
Also published as: KR20230125683A; KR102623681B1

Abstract

According to a malicious code infection detecting system and method through a network communication log analysis, presented in the present invention, log data collected from a network device is analyzed so as to identify periodic access, to an attack command server, of user devices infected with malicious code, and thus a host infected with malicious code in user devices that periodically attempt to access an external network can be quickly and accurately identified.

Description

Malicious code infection detection system and method through network communication log analysis

The present invention relates to a system and method for detecting a malicious code infection, and more particularly, to a system and method for detecting a malicious code infection through network communication log analysis.

In general, a method of detecting malicious code is detected by matching a specific code section of the malicious code. That is, the conventional malicious code detection method detects malicious code using pattern information (eg, antivirus) of a specific code section for a file suspected of being malicious code.

In particular, the conventional antivirus detection method detects malicious codes based on the bytes of a specific code section used by malicious codes, or various log information and file structure information generated while the malicious codes are operating. It determines whether it is malicious by measuring the risk level with the presence or absence of suspicious DLLs and API functions created by . However, these methods have difficulties in detecting new malicious codes and variant malicious codes.

In addition, in the case of a general virtual machine (VM)-based malicious file analysis system, when it is determined that malicious code is executed within the analysis system through various anti-virtual machine (anti-VM) technologies and anti-debugging technologies , the malicious code itself is stopped, or the malicious code is running but does not cause any malicious behavior. Therefore, since it is not analyzed in the analysis system, difficulties may arise in analysis.

Recently, research on technology for detecting malicious codes using artificial intelligence technologies such as deep learning has been actively attempted. As a related prior art, Patent Registration No. 10-1880686 (Title of Invention: Malicious Code Detection System Based on AI Deep Learning, Registration Date: July 16, 2018) has been disclosed.

As such, attacks using malicious codes have become a trend, and various technologies for detecting malicious codes are being developed. have difficulties with Therefore, it is necessary to develop a technology that can solve this problem.

The present invention is proposed to solve the above problems of the previously proposed methods, by analyzing log data collected from network devices and identifying that user devices infected with malicious code periodically access the attack command server, An object of the present invention is to provide a system and method for detecting malicious code infection through network communication log analysis, which can quickly and accurately identify a host infected with malicious code among user devices attempting to access an external network.

A malicious code infection detection system through network communication log analysis according to the features of the present invention for achieving the above object is,

As a malware infection detection system,

Identify user devices infected with malicious code by recognizing that user devices infected with malicious code periodically access the attack command server (C&C server, Command and Control Server),

A log collection module for collecting log data from a network device connected to the user device;

a feature extraction module extracting feature information consisting of a communication date and time, a source IP, and a destination domain from the log data collected by the log collection module; and

It is characterized in that it includes an analysis module that analyzes the characteristic information extracted from the characteristic extraction module and identifies the IP of the user device estimated to have accessed the attack command server.

Preferably, the log collection module,

Collecting the log data from at least one network device of a firewall including a source IP and a destination IP, a web proxy including a source IP and a connection URL, and a packet forensic solution;

The destination domain constituting the characteristic information may be the destination IP or access URL.

Preferably, the feature extraction module,

(2-1) querying the log data for a preset period of time;

(2-2) extracting the characteristic information from the searched log data, separating and storing the characteristic information in units of reference time; and

(2-3) The characteristic information may be extracted and refined from the log data by performing a step of refining by removing duplicates from the characteristic information separated and stored in units of the reference time.

More preferably, the analysis module,

(3-1) calculating the total number of connections accessing the destination domain and the number of source IPs based on the destination domain from the characteristic information stored in units of the reference time;

(3-2) counting the number of access days to the destination domain for each source IP within the preset period, and identifying the source IP that accesses the destination domain for more than a threshold number of days; and

(3-3) Calculate the risk level using the results calculated in steps (3-1) and (3-2), determine whether or not it is malicious, and perform the step of identifying the source IP that is determined to be malicious. It can identify user devices infected with the code.

Even more preferably, the risk is,

Calculated for each destination domain, the date of access to the destination domain within the set period for the number of source IPs accessing the destination domain within the set period, and the average access date of access to all destination domains within the set period It can be calculated as the product of the average total number of connections made to all destination domains.

In order to achieve the above object, a malicious code infection detection method through network communication log analysis according to the features of the present invention,

A malicious code infection detection method in which each step is performed in a malicious code infection detection system,

(1) collecting log data from a network device connected to the user device;

(2) extracting characteristic information consisting of communication date and time, source IP, and destination domain from the log data collected in step (1); and

(3) Analyzing the characteristic information extracted in step (2) to identify the IP of the user device that is presumed to have accessed the attack command server.

Preferably, the step (2) is,

(2-1) querying the log data for a preset period of time;

(2-3) may include a step of refining by removing redundancy from the characteristic information separated and stored in units of the reference time.

More preferably, the step (3) is,

(3-3) Calculate the risk level using the results calculated in steps (3-1) and (3-2), determine whether or not it is malicious, and identify the source IP that is determined to be malicious. .

According to the system and method for detecting malicious code infection through network communication log analysis proposed in the present invention, by analyzing log data collected from network devices and identifying that user devices infected with malicious code periodically access an attack command server, It is possible to quickly and accurately identify hosts infected with malicious codes among user devices that periodically try to access external networks.

1 is a diagram showing the configuration of a malicious code infection detection system through network communication log analysis according to an embodiment of the present invention.

2 is a flowchart illustrating a method for detecting a malicious code infection through network communication log analysis according to an embodiment of the present invention.

FIG. 3 is a diagram showing, for example, log data collected in step S100 of a method for detecting a malicious code infection through network communication log analysis according to an embodiment of the present invention.

4 is a diagram showing a detailed flow of step S200 in the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention.

FIG. 5 is a diagram showing, for example, characteristic information separated on a daily basis in step S220 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention.

6 is a diagram showing, for example, characteristic information refined by removing redundancy in step S230 of the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention.

7 is a diagram showing a detailed flow of step S300 in the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention.

FIG. 8 is a diagram showing, for example, the total number of connections to the destination domain and the number of source IPs calculated in step S310 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention.

9 is a diagram showing, for example, the number of access days counted in step S320 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention.

100: Malicious code infection detection system

110: log collection module

120: feature extraction module

130: analysis module

S100: Collecting log data from network devices

S200: Extracting characteristic information from log data

S210: Step of querying log data for a preset period

S220: Step of extracting characteristic information from the searched log data, dividing it into standard time units and storing it

S230: step of refining by removing redundancy from the separated and stored characteristic information in units of reference time

S300: Identifying the IP of the user device that is presumed to have accessed the attack command server

S310: Calculating the total number of connections accessed to the destination domain and the number of source IPs on the basis of the destination domain from characteristic information stored in units of standard time

S320: Counting the number of days accessing the destination domain for each source IP within a preset period, and identifying source IPs accessing the destination domain over a threshold date

S330: step of calculating the risk level, determining whether it is malicious, and identifying the source IP determined to be malicious

Hereinafter, preferred embodiments will be described in detail so that those skilled in the art can easily practice the present invention with reference to the accompanying drawings. However, in describing a preferred embodiment of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and actions.

In addition, throughout the specification, when a part is said to be 'connected' to another part, this is not only the case where it is 'directly connected', but also the case where it is 'indirectly connected' with another element in between. include In addition, 'including' a certain component means that other components may be further included, rather than excluding other components unless otherwise specified.

1 is a diagram showing the configuration of a malicious code infection detection system 100 through network communication log analysis according to an embodiment of the present invention. As shown in FIG. 1, in the malicious code infection detection system 100 through network communication log analysis according to an embodiment of the present invention, a user device infected with malicious code is an attack command server (C & C server, C2 server, Command a log collection module 110 that identifies a user device infected with malicious code by identifying periodic access to the user device and collects log data from a network device connected to the user device; a characteristic extraction module 120 for extracting characteristic information consisting of a communication date and time, a source IP, and a destination domain from log data collected by the log collection module 110; and an analysis module 130 that analyzes the characteristic information extracted from the characteristic extraction module 120 and identifies the IP of the user device estimated to have accessed the attack command server.

In order to solve the difficulty of detection due to recent malware variants and detection bypass techniques, based on the fact that PCs infected with malware periodically access the C2 server, a network log analysis such as firewall logs is used to detect malware infection. there is a try However, the traffic log of normal users is so large that it reaches more than 10GB per day, so it takes a lot of time to analyze, and normal programs such as Windows update also periodically access the update server.

According to the present invention, (1) when a user PC (user device) is infected with malicious code, it periodically attempts to access the C2 server to notify the attacker of the infection or to transmit information stored in the user device to the attacker; (2) In general, communication information of internal and external networks is recorded in the form of logs on network equipment such as firewalls in a corporate environment. (3) Only a small number of PCs infected with malicious codes access the C2 server, and many users access It focuses on the fact that it is different from general sites such as Internet portals. It applies an analysis technique developed in-house to network log data collected from network devices such as firewalls to detect malicious code among user PCs that periodically try to access external networks. Infected hosts can be quickly and accurately identified.

2 is a flow diagram illustrating a method for detecting a malicious code infection through network communication log analysis according to an embodiment of the present invention. As shown in FIG. 2, the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention is a method for detecting malicious code infection in which each step is performed in the malicious code infection detection system 100, Identifying user devices infected with malicious codes by recognizing periodic access to attack command servers (C&C servers, Command and Control Servers) by user devices infected with malicious codes, and collecting log data from network devices (S100); It can be implemented including extracting characteristic information from data (S200) and identifying the IP of a user device that is estimated to have accessed the attack command server (S300).

In step S100, the log collection module 110 may collect log data from a network device connected to the user device. More specifically, the log collection module 110 may collect log data from at least one network device of a firewall including a source IP and a destination IP, a web proxy including a source IP and access URL, and a packet forensic solution. .

3 is a diagram showing, for example, log data collected in step S100 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention. As shown in FIG. 3, in step S100 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention, the log collection module 110 collects log data necessary for analysis from the network device, , At this time, the log data may include information such as source IP, access date and time, and destination domain (destination IP or access URL).

In step S200, the feature extraction module 120 may extract feature information consisting of a communication date and time, a source IP, and a destination domain from the log data collected in step S100. Here, the destination domain constituting the characteristic information may be a destination IP or access URL.

4 is a diagram showing a detailed flow of step S200 in the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention. As shown in FIG. 4, in step S200 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention, each step is performed in the feature extraction module 120, and for a preset period Searching log data (S210), extracting characteristic information from the searched log data, separating and storing it in units of standard time (S220), and removing redundancy from the stored characteristic information in units of standard time and refining it. It may be implemented including (S230).

In step S210, log data can be inquired for a preset period. In step S210, real-time monitoring is enabled by inquiring log data of a preset period from the most recent date and time, and the preset period may be a recent period of one week or more.

In step S220, characteristic information may be extracted from the searched log data, separated into reference time units, and stored. Here, the reference time may be 1 hour or 1 day.

FIG. 5 is a diagram showing, for example, characteristic information separated on a daily basis in step S220 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention. As shown in FIG. 5, in step S220 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention, the communication date and time, the source IP, in the log data inquired about for a week (preset period) Characteristic information composed of and destination domains may be extracted, and the extracted characteristic information may be separated in units of one day (standard time) and stored for each access date. At this time, in the access URL, even files excluding parameters can be extracted. That is, the detailed address can be deleted from the access URL, and only the main address such as “www.attackter.com”, “www.naver.com”, and “normal_update.com” can be extracted as the destination domain.

In step S230, redundancy may be removed and refined from the separated and stored characteristic information in units of reference time. More specifically, as shown in FIG. 5 , redundant data may be removed based on (access date, source IP, destination domain) for characteristic information separated by access date.

6 is a diagram showing, for example, characteristic information refined by removing redundancy in step S230 of the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention. That is, in step S230 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention, duplicate data is removed from the characteristic information for each access date as shown in FIG. can be refined. Here, log data collected from network devices to which a total of five user devices (source IPs 1.1.1.1 to 5.5.5.5) are connected are queried for a week from January 1, 2021 to January 7, 2021, and the characteristic information After extraction, they were separated on a daily basis to remove duplicates.

In step S300, the analysis module 130 analyzes the characteristic information extracted in step S200 to identify the IP of the user device estimated to have accessed the attack command server. That is, as shown in FIG. 6 , the analysis module 130 may identify a user device infected with malicious code through analysis of the purified data. In particular, in the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention, a self-developed analysis technique is applied to quickly and quickly detect hosts infected with malicious code among user PCs that periodically try to access external networks. can be accurately identified.

7 is a diagram showing a detailed flow of step S300 in the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention. As shown in FIG. 7 , in step S300 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention, each step is performed in the analysis module 130, and the characteristics stored in units of reference time Based on the destination domain in the information, calculating the total number of connections to the destination domain and the number of source IPs (S310), counting the days of access to the destination domain in a preset period for each source IP, and reaching the destination beyond the threshold date It can be implemented by including the step of identifying the source IP accessing the domain (S320), calculating the degree of risk, determining whether or not it is malicious, and identifying the source IP determined to be malicious (S330).

In step S310, the total number of connections accessing the destination domain and the number of source IPs may be calculated based on the destination domain from characteristic information stored in units of reference time.

8 is a diagram showing, for example, the total number of connections to the destination domain and the number of source IPs calculated in step S310 of the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention. As shown in FIG. 8 , in step S310 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention, the total number of accesses for each destination domain per day, which is a reference time, and the corresponding destination domain The number of connected source IPs can be calculated.

In step S320, the number of days accessing the destination domain for each source IP within a preset period may be counted, and source IPs accessing the destination domain for more than a threshold number of days may be identified.

9 is a diagram showing, for example, the number of access days counted in step S320 of the method for detecting malicious code infection through network communication log analysis according to an embodiment of the present invention. As shown in FIG. 9, in step S320 of the malicious code infection detection method through network communication log analysis according to an embodiment of the present invention, each source IP accesses the destination domain for the last 7 days (preset period). It is possible to count the number of times (“total by date” in FIG. 9) and identify the source IP that accessed the corresponding destination domain for more than 5 days (critical date).

In step S330, a risk level is calculated using the results calculated in steps S310 and S320, whether or not malicious is determined, and a source IP determined to be malicious can be identified. More specifically, the risk calculated in step S330 is calculated for each destination domain, the number of source IPs accessed to the destination domain within the set period, the date of access to the destination domain within the set period, and all destination domains within the set period. It can be calculated as the product of the average total number of connections accessed to all destination domains by the average access date.

Expressing this as a formula, it is equal to cnc_risk_score = (avg_date_cnt / total_src_cnt) * (avg_src_cnt/total_avg_date_cnt). Here, cnc_risk_score is the risk level, avg_date_cnt is the average access date based on the destination domain, total_src_cnt is the total number of source IPs based on the destination domain, avg_src_cnt is the average number of access attempts from the source IP based on all sites, and total_avg_date_cnt is the average access date based on all sites.

The higher the risk value calculated in step S330, the higher the possibility of being infected with malicious code. A threshold value that is a criterion for determining infection with malicious code is set in advance, and if the threshold value is exceeded, it can be determined that there is a risk of infection with malicious code. In the example shown in FIG. 9 , the risk levels may be 24.29 for “www.attacker.com”, 4.81 for “www.naver.com”, and the like, respectively. When the destination domain with the highest risk or the threshold value is 20, the IP “1.1.1.1” of the user device accessing “www.attacker.com”, the destination domain for which the risk exceeding the threshold value was calculated, can be identified. For user devices corresponding to the identified IP “1.1.1.1”, the administrator or user can be notified of the possibility of being infected with malicious code, and measures such as restricting network use can be taken.

As described above, according to the malicious code infection detection system 100 and method through network communication log analysis proposed in the present invention, log data collected from network devices is analyzed, and user devices infected with malicious codes are sent to the attack command server. By identifying periodic access, it is possible to quickly and accurately identify a host infected with malicious code among user devices periodically attempting to access an external network.

Meanwhile, the present invention may include a computer-readable medium including program instructions for performing operations implemented in various communication terminals. For example, computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD_ROMs and DVDs, and floptical disks. It may include hardware devices specially configured to store and execute program instructions, such as magneto-optical media and ROM, RAM, flash memory, and the like.

Such computer-readable media may include program instructions, data files, data structures, etc. alone or in combination. At this time, program instructions recorded on a computer-readable medium may be specially designed and configured to implement the present invention, or may be known and usable to those skilled in computer software. For example, it may include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes generated by a compiler.

The present invention described above can be variously modified or applied by those skilled in the art to which the present invention belongs, and the scope of the technical idea according to the present invention should be defined by the claims below.

Claims

As a malicious code infection detection system 100,

Identify user devices infected with malicious code by recognizing that user devices infected with malicious code periodically access the attack command server (C&C server, Command and Control Server),

a log collection module 110 that collects log data from a network device connected to the user device;

a feature extraction module 120 for extracting feature information consisting of a communication date and time, a source IP, and a destination domain from the log data collected by the log collection module 110; and

An analysis module 130 analyzing the characteristic information extracted from the characteristic extraction module 120 to identify the IP of the user device estimated to have accessed the attack command server, Code infection detection system (100).
The method of claim 1, wherein the log collection module 110,

Collecting the log data from at least one network device of a firewall including a source IP and a destination IP, a web proxy including a source IP and a connection URL, and a packet forensic solution;

The destination domain constituting the characteristic information is the destination IP or access URL, characterized in that, the malicious code infection detection system (100) through network communication log analysis.
The method of claim 1, wherein the feature extraction module 120,

(2-1) querying the log data for a preset period of time;

(2-2) extracting the characteristic information from the searched log data, separating and storing the characteristic information in units of reference time; and

(2-3) extracting and refining the characteristic information from the log data by removing redundancy from the characteristic information stored separately in the reference time unit and refining the malicious data through network communication log analysis; Code infection detection system (100).
The method of claim 3, wherein the analysis module 130,

(3-1) calculating the total number of connections accessing the destination domain and the number of source IPs based on the destination domain from the characteristic information stored in units of the reference time;

(3-2) counting the number of access days to the destination domain for each source IP within the preset period, and identifying the source IP that accesses the destination domain for more than a threshold number of days; and

(3-3) Calculate the risk level using the results calculated in steps (3-1) and (3-2), determine whether or not it is malicious, and perform the step of identifying the source IP that is determined to be malicious. A malicious code infection detection system (100) through network communication log analysis, characterized in that a user device infected with code is identified.
The method of claim 4, wherein the risk is,

Calculated for each destination domain, the date of access to the destination domain within the set period for the number of source IPs accessing the destination domain within the set period, and the average access date of access to all destination domains within the set period A malicious code infection detection system (100) through network communication log analysis, characterized in that the product is calculated as the product of the average total number of connections accessed to all destination domains.
A malicious code infection detection method in which each step is performed in the malicious code infection detection system 100,

Identify user devices infected with malicious code by recognizing that user devices infected with malicious code periodically access the attack command server (C&C server, Command and Control Server),

(1) collecting log data from a network device connected to the user device;

(2) extracting characteristic information consisting of communication date and time, source IP, and destination domain from the log data collected in step (1); and

(3) Analyzing the characteristic information extracted in step (2) to identify the IP of the user device that is presumed to have accessed the attack command server, detecting malware infection through network communication log analysis method.
The method of claim 6, wherein the step (2),

(2-1) querying the log data for a preset period of time;

(2-2) extracting the characteristic information from the searched log data, separating and storing the characteristic information in units of reference time; and

(2-3) a method for detecting malicious code infection through network communication log analysis, characterized in that it includes a step of removing and refining duplicates from the characteristic information separated and stored in units of the reference time.
The method of claim 7, wherein the step (3),

(3-1) calculating the total number of connections accessing the destination domain and the number of source IPs based on the destination domain from the characteristic information stored in units of the reference time;

(3-2) counting the number of access days to the destination domain for each source IP within the preset period, and identifying the source IP that accesses the destination domain for more than a threshold number of days; and

(3-3) Calculate the risk level using the results calculated in steps (3-1) and (3-2), determine whether or not it is malicious, and identify the source IP that is determined to be malicious. A malicious code infection detection method through network communication log analysis.