WO2022191596A1

WO2022191596A1 - Device and method for automatically detecting abnormal behavior of network packet on basis of auto-profiling

Info

Publication number: WO2022191596A1
Application number: PCT/KR2022/003293
Authority: WO
Inventors: 조홍연
Original assignee: 주식회사 씨티아이랩
Priority date: 2021-03-11
Filing date: 2022-03-08
Publication date: 2022-09-15
Also published as: KR20220127757A

Abstract

A method for automatically detecting abnormal behavior of a network packet on the basis of auto-profiling, according to the present invention, comprises the steps of: collecting and storing network packet log data to be model-learned; training an auto-profiling model, a plurality of threat type classification models and an abnormal behavior detection model by using the collected and stored network packet log data, the plurality of threat type classification models being trained with a plurality of different threat types; inputting, into the auto-profiling model, network packet log data to be predicted, so as to primarily classify whether the log data corresponds to a normal or specific threat type, the specific threat type being one of the plurality of threat types; if the network packet log data to be predicted is primarily classified as a specific threat type by the auto-profiling model, inputting, into a threat type classification model corresponding to a specific threat type from among the plurality of threat type classification models, the network packet log data to be predicted, so as to ultimately detect whether a specific attack is present; and if the network packet log data to be predicted is primarily classified as normal by the auto-profiling model, inputting, into the anomaly behavior detection model, the network packet log data to be predicted, so as to finally detect whether normal or abnormal behavior is present.

Description

Apparatus and method for automatic detection of network packet anomalies based on auto-profiling

The present invention relates to an apparatus and method for automatically detecting an anomaly in a network packet based on auto-profiling.

As network systems of companies and institutions become more complex and the amount of data to be handled increases, there are problems in the traffic analysis processing speed and detection performance of network packet anomaly behavior and cyber threat detection technology. To improve this, an artificial intelligence model is used as a method to detect abnormal behavior in network systems such as companies and institutions, but traffic analysis processing speed and detection performance improvement are still required.

Accordingly, the technical problem to be solved by the present invention is to provide an apparatus and method for automatically detecting an anomaly in a network packet based on auto-profiling.

According to the present invention for solving the above technical problem, a method for automatically detecting abnormal behavior of a network packet based on auto-profiling implemented by a computer includes the steps of: collecting and storing network packet log data for model learning; Learning an auto-profiling model, a plurality of threat type classification models, and an anomaly detection model using the collected and stored network packet log data - The plurality of threat type classification models learn about a plurality of different threat types, respectively become -; inputting prediction target network packet log data into the auto-profiling model and first classifying whether it corresponds to normal or a specific threat type, wherein the specific threat type is one of the plurality of threat types; When the prediction target network packet log data is primarily classified as a specific threat type in the auto-profiling model, the prediction target network packet log data is a threat type corresponding to the specific threat false type among the plurality of threat type classification models. finally detecting whether a specific attack exists by inputting it into a classification model; and when the prediction target network packet log data is first classified as normal in the auto-profiling model, inputting the prediction target network packet log data into the abnormal behavior detection model to finally detect whether a normal or abnormal behavior is present step; includes

The auto-profiling model, the plurality of threat type classification models, and the anomaly detection model may be trained using the model learning target network packet log data pre-processed by a predetermined pre-processing method.

The prediction target network packet log data may be pre-processed by the predetermined pre-processing method and then input to the auto-profiling model, a plurality of threat type classification models, and an anomaly detection model.

The predetermined pre-processing method includes grouping network packet log data having the same source IP (src_ip) and destination IP (dst_ip) for each predetermined unit time, and corresponding to numeric data among attributes included in the grouped network packet data Obtaining statistical values for the attribute to be performed, combining character string data among attributes included in the grouped network packet log data into one character string data, and comprising the obtained statistical values and character string data combined into one and generating preprocessed network packet log data.

The anomaly detection model is an auto-encoder, IF (Isolation Forest), LOF (Local Outlier Factor), ESoiNN (enhanced self-organizing incremental neural network), or SOM (Self-Organizing Maps) computer-implemented auto-profiling-based network. Automatic detection of packet anomalies.

According to the present invention for solving the above technical problem, an apparatus for automatically detecting an anomaly in a network packet based on auto-profiling includes: a data storage unit for collecting and storing network packet log data for model learning; A learning unit that learns an auto-profiling model, a plurality of threat type classification models, and an anomaly detection model using the collected and stored network packet log data. Learned -; and when the prediction target network packet log data is primarily classified as a specific threat type in the auto-profiling model, the prediction target network packet log data is a threat corresponding to the specific threat false type among the plurality of threat type classification models It is input to the type classification model to finally detect whether a specific attack exists, and when the prediction target network packet log data is primarily classified as normal in the auto-profiling model, the prediction target network packet log data is detected as the abnormal behavior a real-time prediction classification unit that finally detects normal or abnormal behavior by input to the model; includes

A computer-readable recording medium according to an embodiment of the present invention for solving the above technical problem may record a program for executing the method in a computer.

According to the present invention, it is possible to efficiently apply large-capacity data, which is difficult to analyze, to preprocessing and models. In addition, it is possible to reduce the size of large-scale data by defining the characteristics of each type of cyber threat through auto-profiling and applying the primary data classification through the characteristics of each type of defined cyber threat. It can also improve the prediction speed of deep learning models due to reduced data. In addition, it is possible to improve the model classification performance by applying the deep learning model after the threshold-based primary classification.

1 is a block diagram of an auto-profiling-based network packet anomaly automatic detection system according to an embodiment of the present invention.

2 is a diagram illustrating a network packet data collection and storage process according to the present invention.

3 is a diagram illustrating a network packet data preprocessing process according to the present invention.

4 shows an example in which network packet log data according to the present invention is aggregated and pre-processed to be expressed as data in one row.

5 is a diagram illustrating a data standardization method according to the present invention.

6 is a diagram illustrating a model learning process according to the present invention.

7 is a diagram illustrating a real-time threat classification prediction process according to the present invention.

Then, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily implement them.

1 is a block diagram of an auto-profiling-based network packet anomaly automatic detection system 100 according to an embodiment of the present invention.

Referring to FIG. 1 , the system according to the present invention may include a data storage unit 110 , a data preprocessing unit 120 , a learning unit 130 , and a real-time prediction classification unit 140 .

The data storage unit 110 may collect and store a large amount of network packet log data.

The data storage unit 110 processes the network packet log data in real time as a queue message, parses the data stream processing the queue message processing completed data, and an auto-profiling model that describes the parsed data later, by a plurality of threat types Only elements necessary for learning the classification model and the anomaly detection model can be extracted and stored.

The data storage unit 110 may store various types of information and data related to the operation of the system 100 .

The data pre-processing unit 120 may pre-process the network packet log data stored in the data storage unit 110 and the network packet log data collected in real time according to the purpose and output the pre-processed data.

The data preprocessor 120 may aggregate the collected network packet log data by time and feature, group data for the character string, and perform scaling for each column after vectorizing the character string data.

The data preprocessor 120 may perform data labeling preprocessing for learning on the collected network packet log data.

The learning unit 130 may train and learn an auto-profiling model, a classification model for a plurality of threat types, and an abnormal behavior detection model using data obtained by preprocessing network packet log data. A plurality of threat type classification models may be trained for a plurality of different threat types, respectively. For example, assuming that there are n threat types, n threat type classification models can be trained.

Auto-profiling model can be implemented with Decision Tree, ESoiNN, SOM, etc.

A plurality of threat type classification models may be supervised using CNN, LSTM, RNN, DNN, Random Forest, and the like.

The anomaly detection model may be unsupervised learning using an autoencoder, an isolation forest (IF), a local outlier factor (LOF), an enhanced self-organizing incremental neural network (ESoinn), or self-organizing maps (SOM).

The real-time predictive classification unit 140 receives network packet log data collected in real time using the auto-profiling model, a plurality of threat type classification models, and anomaly detection models learned in the learning unit 130, and receives normal, threat and Abnormal behavior can be detected.

Specifically, the real-time predictive classification unit 140 inputs the network packet log data collected in real time and pre-processed by the data pre-processing unit 120 into a pre-trained auto-profiling model to first classify whether it corresponds to a normal or a specific threat type. can do. In addition, the real-time prediction classification unit 140 may finally detect whether a specific threat exists by inputting data classified as a specific threat type into a threat type classification model corresponding to the threat type primarily classified in the auto-profiling model. Meanwhile, the real-time prediction classifier 140 may input the network packet log data detected normally in the auto-profiling model into the pre-trained abnormal behavior detection model to finally detect whether the normal or abnormal behavior is present.

Referring to FIG. 2 , the data storage unit 110 receives and collects large-capacity network packet log data, then processes the large-capacity data in real time with a queue message (S210), and processes the queue message processing completed data as a data stream. Parsing (S220). The data storage unit 110 reads and extracts only predetermined elements necessary for an auto-profiling model, a classification model for a plurality of threat types, and an anomaly detection model from the parsed data (S230), and inputs them to the data storage (DB) (S240) ). For example, attributes from http traffic such as http agent, http query, http_host, http_method, http_retcode, http_path, dst_port, ingress_if, traffic_bytes, and traffic_packets can be extracted. Other factors other than those exemplified here may be extracted from the network packet log data.

3 is a diagram illustrating a network packet log data preprocessing process according to the present invention.

Referring to FIG. 3 , the data preprocessor 120 may preprocess network packet log data parsed by the data storage unit 110 and stored in the data storage DB.

The data preprocessor 120 may group the network packet log data having the same source IP (src_ip) and destination IP (dst_ip) for each predetermined unit time and preprocess it to be expressed as data in one row.

First, the data preprocessor 120 may group network packet log data having the same source IP (src_ip) and destination IP (dst_ip) into one group for each predetermined unit time (S310).

Next, first, the data preprocessor 120 may obtain a statistical value for an attribute corresponding to numeric data among attributes included in the network packet data of the same group ( S320 ). The statistical value may be at least one of mean, variation, and variance. For example, the average number of traffic packets, the average traffic bytes, and the average number of methods can be obtained.

In addition, the data preprocessor 120 may combine character string data into one character string data among the attributes included in the network packet log data belonging to the same group (S330). String data combined into one can also be expressed by separating string data included in each network packet log data with a delimiter such as a comma (,).

In addition, the data preprocessor 120 may perform normalization on the statistical values and character string data obtained in steps S320 and S330. Statistical values obtained from numeric data can be scaled by using one of various scaling techniques such as Standard scaling and MinMax Scaling as illustrated in FIG. 5(a). Meanwhile, string data may be converted into a numeric vector form and scaled by applying the standardization method of numeric data described above.

To convert string data into a numeric vector form, techniques such as TF_IDF (Term Frequency - Inverse Document Frequency), Count Vectorization, and word2vec may be used. For example, the TF-IDF as exemplified in FIG. 5(b) analyzes a character string and assigns a vector that fits according to the frequency of each character string. After converting log data into numbers, it can be transformed into a matrix by combining them into one vector.

Thereafter, the data preprocessor 120 may generate and output preprocessed network packet log data including string data combined with the previously obtained statistical values ( S340 ).

It is also possible to convert the network packet log data into the input value form required by the AI-based model in the preprocessor 120 by an appropriate method other than the method exemplified here. Various known methods for converting network packet log data into the form of input values required by the AI-based model may be used.

The real-time prediction target network packet log data goes through steps (S310) to (S340), and the model training target network packet log data is normal before step (S310) to step (S340) Data labeling ( Labeling) is performed first. The network packet log data for model training can be obtained from firewalls and security equipment and labeled by the equipment, or labeling can be performed from the network log based on the ruleset.

Referring to FIG. 6 , the learning unit 130 receives pre-processed data to be input to model learning from the data pre-processing unit 120 and trains the auto-profiling model, a plurality of threat type classification models, and an anomaly detection model. proceed with learning.

Specifically, the auto-profiling model can learn both normal and threat type data after pre-processing, so that when new data is input, it can be learned to primarily classify whether the data is normal or corresponds to a specific threat type (S610).

The anomaly detection model uses autoencoder, IF (Isolation Forest), LOF (Local Outlier Factor), ESoiNN (enhanced self-organizing incremental neural network), or SOM (Self-Organizing Maps), etc. Unsupervised learning may be performed to detect an abnormal behavior when abnormal data is received by learning only the data (S620). Therefore, when new data comes in, classification is primarily carried out through the auto-profiling model, and the data determined to be normal in the auto-profiling model is re-entered into the anomaly detection model. Through this, it is possible to detect abnormal behavior that does not fall under the threat type, but is not normal.

Each of the plurality of threat type classification models may be trained to detect a specific threat by learning only network packet log data labeled with a specific threat type corresponding thereto ( S630 ). Therefore, when new network packet log data comes in in real time, classification is primarily carried out through the auto-profiling model, and according to each threat type classified in the auto-profiling model, it is input into the corresponding threat type classification model. Final threat detection is possible.

As mentioned above, normal data means data that is not a threat, threat data means data that can be classified as a specific threat name because the characteristics of the threat are clear, and abnormal data does not correspond to a threat and is classified as normal, but It means data that has characteristics different from the pattern of normal data.

The learning unit 130 may store the model learned in steps S610, S620, and S630 (S640).

Referring to FIG. 7 , the real-time prediction classification unit 140 may predict and classify real-time threats and abnormal behaviors using the model learned in steps S610, S620, and S630. That is, the learning unit 130 may receive the preprocessed data to be input to the model prediction and apply it to the model learned and stored by the learning unit 130 . First, the real-time prediction classification unit 140 first classifies the data input in real time by the auto-profiling model according to characteristics (S710). In step S710, it is classified as normal or threatened, and in the case of a threat, it is primarily classified up to the type.

Next, the data classified as normal by the auto-profiling model detects an abnormal behavior through the abnormal behavior detection model (S720).

Meanwhile, the data classified as a threat by the auto-profiling model is finally detected as a cyber threat through the corresponding threat type classification model (S730).

Meanwhile, performance evaluation of the results predicted by the real-time prediction classifier 140 may also be performed (S740).

As described above, according to the present invention, problems of traffic analysis processing speed and detection performance of existing network packet anomaly behavior and cyber threat detection technology can be improved by using artificial intelligence (deep learning) technology. In addition, the technology according to the present invention starts by storing data in the DB, aggregates data by time group, creates a string data bundle, and goes through a string data vectorization process and a numerical data scaling process. The scaled data is Auto Profiling based on Decision Tree, Isolation Forest (IF), Local Outlier Factor (LOF), enhanced self-organizing incremental neural network (ESoinn), or Self-Organizing Maps (SOM). ) extracts meaningful feature sets, learns the importance of inherent features, and the form of data through the model. Outlier Factor), ESoinn (enhanced self-organizing incremental neural network) or SOM (Self-Organizing Maps), etc.) to finally detect anomalies and cyber threats.

The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. may be permanently or temporarily embody in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

Claims

collecting and storing model training target network packet log data;

Learning an auto-profiling model, a plurality of threat type classification models, and an anomaly detection model using the collected and stored network packet log data - The plurality of threat type classification models learn about a plurality of different threat types, respectively become -;

inputting prediction target network packet log data into the auto-profiling model and first classifying whether it corresponds to normal or a specific threat type, wherein the specific threat type is one of the plurality of threat types;

When the prediction target network packet log data is primarily classified as a specific threat type in the auto-profiling model, the prediction target network packet log data is a threat type corresponding to the specific threat false type among the plurality of threat type classification models. finally detecting whether a specific attack exists by inputting it into a classification model; and

When the prediction target network packet log data is primarily classified as normal in the auto-profiling model, inputting the prediction target network packet log data into the abnormal behavior detection model to finally detect whether a normal or abnormal behavior is present ;

A method for automatically detecting abnormal behavior of network packets based on auto-profiling implemented by a computer comprising a.
In claim 1,

The auto-profiling model, the plurality of threat type classification models, and the anomaly detection model are automatic auto-profiling-based network packet abnormal behaviors implemented by a computer that is trained using the model learning target network packet log data pre-processed by a predetermined pre-processing method. detection method.
In claim 2,

After the predicted target network packet log data is pre-processed by the predetermined pre-processing method, the auto-profiling-based network packet abnormal behavior is automatically input to the auto-profiling model, a plurality of threat type classification models, and an anomaly detection model. detection method.
In claim 3,

The predetermined pre-processing method is,

Grouping network packet log data having the same source IP (src_ip) and destination IP (dst_ip) for each predetermined unit time;

obtaining statistical values for an attribute corresponding to numeric data among attributes included in the grouped network packet data;

combining string data among properties included in the grouped network packet log data into one string data; and

Generating preprocessed network packet log data including the obtained statistical values and the string data combined into one

A method for automatically detecting abnormal behavior of network packets based on auto-profiling implemented by a computer comprising a.
In claim 1,

The anomaly detection model is an auto-encoder, IF (Isolation Forest), LOF (Local Outlier Factor), ESoiNN (enhanced self-organizing incremental neural network), or SOM (Self-Organizing Maps) computer-implemented auto-profiling-based network. Automatic detection of packet anomalies.
A computer-readable recording medium in which a program for executing the method according to any one of claims 1 to 5 is recorded on a computer.
a data storage unit for collecting and storing model learning target network packet log data;

A learning unit that learns an auto-profiling model, a plurality of threat type classification models, and an anomaly detection model using the collected and stored network packet log data. Learned -; and

When the prediction target network packet log data is primarily classified as a specific threat type in the auto-profiling model, the prediction target network packet log data is a threat type corresponding to the specific threat false type among the plurality of threat type classification models. It is input to a classification model to finally detect whether a specific attack exists, and when the prediction target network packet log data is first classified as normal in the auto-profiling model, the prediction target network packet log data is converted into the abnormal behavior detection model a real-time prediction classification unit that finally detects normal or abnormal behavior by inputting into the ;

Auto-profiling-based network packet anomaly automatic detection device comprising a.
In claim 7,

The auto-profiling model, the plurality of threat type classification models, and the anomaly detection model are an auto-profiling-based network packet abnormal behavior automatic detection device that is learned using the model learning target network packet log data pre-processed by a predetermined pre-processing method.
In claim 8,

The auto-profiling-based network packet anomaly detection apparatus which is input to the auto-profiling model, a plurality of threat type classification models, and an anomaly detection model after the prediction target network packet log data is pre-processed by the predetermined pre-processing method.
10. In claim 9,

Network packet log data having the same source IP (src_ip) and destination IP (dst_ip) are grouped for each predetermined unit time, and statistical values are obtained for attributes corresponding to numeric data among attributes included in the grouped network packet data, , Data pre-processing for generating pre-processed network packet log data including string data combined with the obtained statistical values and the obtained statistical values by combining string data among attributes included in the grouped network packet log data into one string data wealth; Auto-profiling-based network packet anomaly automatic detection device further comprising a.
In claim 7,

The anomaly detection model is an auto-encoder, IF (Isolation Forest), LOF (Local Outlier Factor), ESoiNN (enhanced self-organizing incremental neural network), or SOM (Self-Organizing Maps) based on auto-profiling based network packet anomaly. detection device.